Current location - Education and Training Encyclopedia - Resume - Hadoop system principle
Hadoop system principle
1.Introduction to Hadoop Hadoop is an open source software framework under Apache and a software platform for developing and running large-scale java. It allows large data sets to be distributed on a large number of computer clusters using a simple programming model.

Hadoop in a narrow sense refers to Apache, an open source framework. Its core components are:

HDFS (Distributed File System): Solving Mass Data Storage

YARN (Job Scheduling and Cluster Resource Management Framework): Solve resource task scheduling.

MAPREDUCE (Distributed Computing Programming Framework): Solving Massive Data Computing

Broadly speaking, Hadoop usually refers to a broader concept-Hadoop ecosystem.

At present, HADOOP has grown into a huge system. With the growth of the ecosystem, there are more and more new projects, including some non-Apache projects, which are a good complement to Hadoop or a higher level of abstraction.

2.2 characteristics. Hadoop

Extensible): Hadoop distributes data and completes computing tasks among available computer clusters, which can be easily extended to thousands of nodes.

Economy): Hadoop distributes and processes data by forming a server cluster with ordinary and cheap machines, which is very low in cost.

Efficient: Hadoop can dynamically move data in parallel between nodes through concurrent data, which is very fast.

Rellable: It can automatically maintain multiple copies of data and automatically redeploy computing tasks after a task fails. So Hadoop's ability to store and process data bit by bit deserves people's trust.

3. Version history 65438+0.x version series of Hadoop: the second generation open source version of Hadoop, which mainly fixes some bugs of 0.x version and has been eliminated.

2.x version series: the architecture has undergone major changes, and many new features such as yarn platform have been introduced, which is the mainstream version currently in use.

3.x version series: HDFS, MapReduce, YARN have been greatly upgraded, and ozone key value storage has also been added.

4. introduction 4 Hadoop architecture and model.

Hadoop 2.0 was developed based on JDK 1.7, and the update of JDK 1.7 stopped in April 20 15, which directly forced the Hadoop community to re-release a new version of Hadoop based on JDK 1.8, namely hadoop 3.0. Hadoop 3.0 introduces some important functions and optimizations, including HDFS erasable coding, multi-Namenode support, MR native task optimization, YARN IO isolation between memory and disk based on cgroup, YARN container resizing and so on.

According to the latest news from Apache hadoop project team, hadoop3.x will adjust the scheme architecture in the future and use Mapreduce based on memory +io+ disk to jointly process data. The biggest change is hdfs, hdfs, which is calculated by the latest block. According to the principle of nearest calculation, local blocks are added to the memory, and the calculation is carried out first. IO shares the memory calculation area, and finally the calculation result is formed quickly, which is 10 times faster than Spark.