Current location - Education and Training Encyclopedia - Graduation thesis - Borg thesis
Borg thesis
( 1) Hadoop 1.0

The first generation of Hadoop consists of distributed storage system HDFS and distributed computing framework MapReduce, in which HDFS consists of a NameNode and several DataNode, MapReduce consists of a JobTracker and several TaskTracker, and the corresponding Hadoop versions are Hadoop 1.x and 0.2 1. x,0.22.x

(2) Hadoop 2.0

The second generation Hadoop was put forward to overcome various problems existing in HDFS and MapReduce in Hadoop 1.0. Aiming at the problem that a single NameNode in Hadoop 1.0 restricts the expansion of HDFS, an HDFS federation is proposed, which allows multiple NameNodes to manage different directories, thus achieving access isolation and horizontal expansion. Aiming at the shortcomings of MapReduce in Hadoop 1.0 in scalability and multi-frame support, a brand-new resource management framework YARN (another resource negotiator) is proposed, which separates the functions of resource management and job control in JobTracker. It is implemented by the components ResourceManager and ApplicationMaster respectively, in which ResourceManager is responsible for resource allocation of all applications and ApplicationMaster is only responsible for managing one application. The corresponding Hadoop versions are Hadoop 0.23.x and 2.x.

(3) MapReduce 1.0 or MRV 1 (MapReduce version 1).

The first generation MapReduce computing framework consists of two parts: programming model and runtime environment. Its basic programming model is to abstract the problem into two stages: Map and Reduce. In the Map stage, the input data is parsed into key/value, and then the map () function is called iteratively, and then it is output to the local directory in the form of key/value. In the Reduce stage, the values with the same keyword are adjusted, and the final result is written into HDFS. Its runtime environment consists of two services: JobTracker and TaskTracker, in which JobTracker is responsible for the resource management and control of all jobs, and TaskTracker is responsible for receiving and executing the commands of JobTracker.

(4)MapReduce 2.0 or MRv2(MapReduce version 2) or NextGen MapReduc.

MapReduce 2.0 or MRv2 has the same programming model as MRv 1, but the only difference is the runtime environment. MRv2 is MRv 1, which is processed on the basis of MRv 1 and runs on the resource management framework YARN. It is no longer composed of JobTracker and TaskTracker, but has become a job control process ApplicationMaster, which is only responsible for the management of one job. As for the management of resources, yarn will complete it.

In short, MRv 1 is an independent offline computing framework, while MRv2 is MRv 1 running on YARN.

(5)Hadoop-MapReduce (an offline computing framework)

Hadoop is an open source implementation of google's distributed computing framework MapReduce and distributed storage system GFS. It consists of MapReduce, a distributed computing framework, and HDFS (Hadoop Distributed File System), a distributed storage system. It has the characteristics of high fault tolerance, high scalability and simple programming interface, and is adopted by most Internet companies.

(6) Hadoop-Yarn (a branch of Hadoop 2.0, actually a resource management system)

YARN is a sub-project of Hadoop (juxtaposed with MapReduce), which is actually a unified resource management system, and can run various computing frameworks (including MapReduce, Spark, Storm, MPI, etc.). ) on it.

The current Hadoop version is confusing, which makes many users feel at a loss. In fact, there are only two versions of Hadoop: Hadoop 1.0 and Hadoop 2.0. Among them, Hadoop 1.0 consists of distributed file system HDFS and offline computing framework MapReduce, while Hadoop 2.0 includes HDFS supporting NameNode's horizontal expansion, resource management system YARN and offline computing framework MapReduce running on YARN. Compared with Hadoop 1.0, Hadoop 2.0 has more powerful functions, better scalability and performance, and supports a variety of computing frameworks.

Systems such as Borg/YARN/Mesos/Torca/Corona can build an internal ecosystem for the company, and all applications and services can run on this ecosystem in a "peaceful and friendly" way. With this system, you don't have to worry about which version of Hadoop to use, Hadoop 0.20.2 or Hadoop 1.0, and you don't have to worry about which computing model to choose. All kinds of software versions and computing models can run together on a "supercomputer".

From the point of view of open source, YARN's proposal weakens the debate about the advantages and disadvantages of multi-computing framework to some extent. YARN evolved from Hadoop MapReduce. In the era of MapReduce, many people criticized that MapReduce was not suitable for iterative computing and churn computing, so there appeared computational frameworks such as Spark and Storm. The developers of these systems compared with MapReduce on their own websites or papers and advocated how their systems were advanced and efficient. After YARN appeared, the situation became clear: MapReduce is just an application abstraction running on YARN, and Spark and Storm are essentially the same. They are only developed for different types of applications, each with its own advantages and disadvantages. Besides, if nothing unexpected happens, all future computing frameworks should be developed on YARN. In this way, an ecosystem with YARN as the underlying resource management platform and various computing frameworks running on it was born.

At present, spark is a very popular framework for memory computing (or iterative computing, DAG computing). Today, MapReduce is widely criticized for its inefficiency, and the appearance of spark can't help but make everyone shine.

From the point of view of architecture and application, spark is a development library that only contains computational logic (although it provides independent master/slave services, it is usually not used because of its stability and inheritance with other types of jobs), but it does not contain any implementation related to resource management and scheduling, which makes spark run flexibly on the current mainstream resource management systems. The typical representatives are mesos and yarn, and we call it "spark on mesos". Running spark on the resource management system will bring many benefits, including: sharing cluster resources with other computing frameworks; Resources are allocated according to needs to improve the utilization rate of cluster resources.

Frame on yarn

Frames running on yarns include MapReduce-on-yarn, Spark-on-yarn, Storm-on-yarn and Tez-On-YARN.

(1) MapReduce-on-yarn: off-line calculation on yarn;

(2) Spark on yarn: memory calculation on yarn;

(3) On-yarn storm: On-yarn real-time/flow calculation;

(4) Tez on yarn: DAG calculation on yarn.