(1.) Big data refers to a collection of data whose contents cannot be crawled, managed and processed by conventional software in a certain period of time. In short, the amount of data is too large to be processed by traditional tools, such as relational databases and data warehouses. What is the order of magnitude of "big" here? For example, Alibaba processes 20PB (2097 1520GB) of data every day.
2. The characteristics of big data:
(1.) Huge. According to the current development trend, the volume of big data has reached PB level or even EB level.
(2) There are various data types of big data, mainly unstructured data, such as online magazines, audio, video, pictures, geographical location information, transaction data, social data and so on.
(3) Low value density. Valuable data accounts for only a small part of the total data. For example, in a video, only a few seconds of information is valuable.
(4) Fast generation and demand processing. This is the most striking feature in the field of big data and traditional data mining.
3. Besides, there are other processing systems that can handle big data.
Hadoop (open source)
Spark (open source)
Storm (open source)
MongoDB (open source)
IBM PureDate (business edition)
Oracle Exadata Database Machine (Commercial)
SAP Hana (business)
Teradata AsterData (business)
EMC GreenPlum (commercial)
HP Vertica (Commercial)
Note: Only Hadoop is introduced here.
Two: Hadoop architecture
Hadoop source code:
Hadoop originated from three papers on GFS(Google File System), MapReduce and BigTable published by Google in 2003 and 2004, and was founded by Doug Cutting. Hadoop is now the top project of Apache Foundation. "
Hadoop is a fictional name. Doug Cardin's baby is named after his yellow toy elephant.
The core of Hadoop:
(1.)HDFS and MapReduce are the two cores of Hadoop. The underlying support for distributed storage is realized through HDFS, so as to realize high-speed parallel reading and writing and large-capacity storage expansion.
(MapReduce supports distributed tasks to ensure high-speed partition processing of data.
3.Hadoop subproject:
(1.)HDFS: Distributed file system, the cornerstone of the whole Hadoop system.
(2) MapReduce/Yarn: Parallel programming model. YARN is the second generation MapReduce framework. Since Hadoop version 0.23.0 1, MapReduce has been rebuilt, usually called MapReduce V2, and the old MapReduce is also called MapReduce V 1.
(3.)Hive: a data warehouse built on Hadoop, which provides a query method similar to SQL voice to query data in Hadoop.
(5.)h base:Hadoop database is Hadoop's distributed and column-oriented database, which comes from Google's paper on BigTable and is mainly used for random access and real-time reading and writing of big data.
(6.)ZooKeeper: It is a coordination service designed for distributed applications, which mainly provides synchronization, configuration management, grouping and naming services for users, reducing the coordination tasks undertaken by distributed applications.
There are many other projects, which will not be explained here.
Third, install Hadoop running environment.
User created:
(1.) Create a Hadoop user group and enter the command:
groupadd hadoop
(2) create an hduser and enter the command:
useradd–p Hadoop HD user
(3) set the password of hduser and enter the command:
Password hduser
Enter the password twice as prompted.
(4) add permissions for hduser, and enter the command:
# Modify permissions
chmod 777 /etc/sudoers
# Edit sudoers
Gedit /etc/sudoers
# Restore default permissions
chmod 440 /etc/sudoers
First, modify the sudoers file permissions, find the line "root ALL=(ALL)" in the text editing window, and then update and add the line "hduser ALL=(ALL) ALL" to add hduser to sudoers. Remember to restore the default permissions after adding, otherwise the sudo command will not be allowed.
(5) After the setup is completed, restart the virtual machine and enter the command:
Sudo restart
Switch to hduser login after reboot.
Install JDK
(1.) Download jdk-7u67-linux-x64.rpm and enter the download directory.
(2) Run the installation command:
sudo rpm–IVH JDK-7u 67-Linux-x64 . rpm
When finished, check the installation path and enter the command:
rpm–QA JDK–l
Remember this road,
(3) Configure environment variables and enter commands:
Sudo gedit /etc/profile
Open the profile and add the following at the bottom of the file
Export java _ home =/usr/java/jdk.7.0.67.
Export classpath =$ JAVA_HOME/lib:$ CLASSPATH
Export path =$ JAVA_HOME/bin:$PATH
Close the file after saving, and then enter the command to make the environment variable take effect:
Source /etc/ profile
(4) Verify the JDK and enter the command:
Java-–version
If the correct version appears, the installation is successful.
Configure password-free login for local SSH:
(1.) Use ssh-keygen to generate private key and public key files, and enter the command:
ssh-keygen–t RSA
(2) The private key is left in this machine, and the public key is sent to other hosts (now localhost). Enter the command:
Ssh-copy-id local host
(3) Log in with the public key and enter the command:
Ssh local host
Configure SSH secret login for other hosts.
(1.) Clone twice. Select the virtual machine in the left column of VMware, right-click, and select Manage-Clone command in the shortcut menu that pops up. Select "Create Full Clone" when cloning a type, click "Next" and press the button until it is finished.
(2) Start and enter three virtual machines respectively, and use ifconfig to query the IP address of the host.
(3) Modify the host name and host file of each host.
Step 1: Modify the host name and enter the command in each host.
sudo gedit/etc/sys config/network
Step 2: Modify the host file:
sudo gedit /etc/hosts
Step 3: Modify the IP of three virtual machines.
The IP of the node 1 corresponding to the first virtual machine is192.168.1.130.
The IP of the second node2 virtual machine is192.168.5438+0.11.
The IP of the third node3 virtual machine is192.168.438+0.438+032.
(4) Because the key pair has been generated on node 1, all you need to do now is to enter the command on node 1:
ssh-copy-id node2
Ssh-copy-id node 3
In this way, the public key of node 1 can be published to node2 and node3.
(5) Test SSH, and enter the command on node 1:
Ssh node 2
# Logout
export
Ssh node 3
export
Four: Hadoop fully distributed installation
1.Hadoop has three modes of operation:
(1.) Stand-alone mode: Hadoop is regarded as an independent Java process, running in non-distributed mode without configuration.
(2) Pseudo-distributed: a cluster with only one node, namely a master (master node, master server) and a slave (slave node, slave server). You can use different java processes on this node to simulate various nodes in the distributed system.
(3) Completely distributed: For Hadoop, different systems will have different ways to divide nodes.
Install Hadoop
(1.) Get the Hadoop compressed package hadoop-2.6.0.tar.gz, and after downloading, you can use VMWare Tools to share the folder, or use Xftp tools to send it to node 1. Enter node 1 to extract the compressed package into the /home/hduser directory, and enter the command: # to enter the home directory, that is, "/HOME/hduser".
cd ~
tar–zxvf hadoop-2.6.0.tar.gz
(2) Rename hadoop input command:
mv hadoop-2.6.0 hadoop
(3) Configure Hadoop environment variables and input commands:
Sudo gedit /etc/profile
Add the following script to the configuration file:
#hadoop
Export Hadoop _ home =/home/hduser/Hadoop.
Export path =$HADOOP_HOME/bin:$PATH
Save and close, and finally enter the command to make the configuration take effect.
Source /etc/ profile
Note: node2 and node3 should be configured according to the above configuration.
Configure Hadoop
(1.)hadoop-env.sh file is used to specify the JDK path. Enter the command:
[hduser @ node 1 ~]$ CD ~/Hadoop/etc/Hadoop
[hduser @ node 1 Hadoop]$ gedit Hadoop-env . sh
Then add the following to specify the jDK path.
Export Java _ home =/usr/Java/JDK1.7.0 _ 67.
(2) Open the specified JDK path and enter the command:
Export Java _ home =/usr/Java/JDK1.7.0 _ 67.
(4) core-site.xml: This file is the Hadoop global configuration. Open it and add configuration attributes to the element, as shown below:
Fs.defaultfshdfs://node1:9000 hadoop.tmp.dirfile:/home/hduser/hadoop/tmp There are two commonly used configuration properties, fs.defaultFS represents the default path prefix when clients connect to hdfs, and 9000 is the working port of HDFS. If Hadoop.tmp.dir is not specified, it will be saved to the system default temporary file directory /tmp. (5.)hdfs-site.xml: This file is the configuration of hdfs. Open and add configuration attributes to the element. (6) Mapred-site.xml: This file is the configuration of MapReduce, which can be copied and opened from the template file mapred-site.xml.template and added to the element. (7.)YARN-site.xml: If the yarn framework is configured in mapred-site.xml, the yarn framework will use the configuration in this file to open and add the configuration attribute to the element. (8) Copy these seven commands to Node 2 and Node 3. Enter the following command: scp–r/home/hduser/Hadoop/etc/Hadoop/hduser @ node2:/home/hduser/Hadoop/etc/scp–r/home/hduser/Hadoop/etc/Hadoop/hduser @ node3:/home/hduser/Hadoop/etc/4. Verification: Let's verify whether hadoop is correct (1. ) and format the NameNode(node 1) on the master host. Enter the command: [hduser @ node 1 ~] $ CD ~/Hadoop [hduser @ node1Hadoop] $ bin/hdfsnamenode–format (2) Close node1,node2, node3, system firewall and restart the virtual machine. Enter the command: service iptables stops chkconfig iptables off reboot (3. Enter the following to start HDFS: [hduser @ node1~] $ cd ~/Hadoop (4. ) start all [hduser@node 1 hadoop]. $ sbin/start-all.sh(5。 ) view the cluster status: [hduser @ node1Hadoop] $ bin/hdfsdfsadmin–report (6. ) View the running status of hdfs in the browser http://node 1:50070(7. ) stop. Enter the command: [hduser @ node 1 Hadoop] $ sbin/stop-all.sh v: Hadoop-related shell operations (1. ) create file 1.txt, /home/hduser/file directory in the operating system of file2.txt, and you can create file2.txt by using the graphical interface. File 1.txt input: Hello World hi HADOOPfile2.txt input Hello World hi CHIAN(2. ) Start hdfs/Input2 and create a directory [hduser @ node1Hadoop] $ bin/Hadoop FS–mkdir/Input2 (3. ) Save the file 1.txt.file2.txt in hdfs: [hduser @ node1Hadoop] $ bin/Hadoop FS–put-/file/file *. txt/Input2/(4。 )【hduser @ node 1。