Current location - Education and Training Encyclopedia - Graduation thesis - What is Hadoop for big data analysis?
What is Hadoop for big data analysis?
To understand what Hadoop is, we must first understand the problems related to big data and traditional processing systems. Next, we will discuss what Hadoop is and how Hadoop can solve problems related to big data. We will also study CERN case studies to highlight the advantages of using Hadoop.

In the previous blog "Big Data Tutorial", we have discussed big data and its challenges in detail. In this blog, we will discuss:

1, the problem of traditional methods

2. Development of 2.Hadoop

3.Hadoop

4.Hadoop ready-to-use solution

5. When to use Hadoop?

6. When will Hadoop not be used?

First, the case study of CERN.

Big data is becoming an opportunity for organizations. Now, organizations have realized that they can get many benefits from big data analysis, as shown in the following figure. They are examining large data sets to discover all hidden patterns, unknown correlations, market trends, customer preferences and other useful business information.

These analysis results are helpful for organizations to implement more effective marketing, new income opportunities and better customer service. They are improving operational efficiency, competitive advantage over competitors and other business advantages.

What is Hadoop-the advantage of big data analysis?

So, let's move on and learn about the issues related to the traditional ways to realize the big data opportunities.

Second, the problems of traditional methods.

In traditional methods, the main problem is to deal with the heterogeneity of data, that is, structured, semi-structured and unstructured. RDBMS mainly focuses on structured data such as bank transactions and operational data, while Hadoop mainly focuses on semi-structured and unstructured data such as text, video, audio, Facebook posts and logs. RDBMS technology is a proven, highly consistent and mature system, which is supported by many companies. On the other hand, because of big data (mainly composed of unstructured data in different formats), Hadoop is needed.

Now let's take a look at the main problems related to big data. Therefore, looking forward, we can understand how Hadoop has become a solution.

What is Hadoop-–Big Data Problem?

The first problem is to store a large amount of data.

A large amount of data cannot be stored in traditional systems. Obviously, storage will be limited to one system, and data is growing at an alarming rate.

The second problem is to store heterogeneous data.

Now, we know that storage is a problem, but let me tell you that this is only part of the problem. Because we have discussed that data is not only huge, but also exists in unstructured, semi-structured and structured formats. Therefore, you need to ensure that there is a system to store all these types of data generated from various sources.

The third problem is the speed of access and processing.

The capacity of hard disk is increasing, but the transmission speed or access speed of disk is not increasing at a similar speed. Let me give you an example to explain: if you only have an I/O channel of 100 Mbps, and you are processing the data of 1TB, it will take about 2.9 1 hour. Now, if you have four computers with one I/O channel, the same amount of data will take about 43 minutes. So access and processing speed is a bigger problem than storing big data.

Before understanding what Hadoop is, let's first understand the development of Hadoop in a period of time.

Development of Hadoop

In 2003, Doug Cutting started the Nutch project, which handled billions of searches and indexed millions of web pages. Late June 65438 +2003 10-Google published papers with GFS(Google File System). From June 5th to February, 2004, Google published the MapReduce paper. In 2005, Nutch operated with GFS and MapReduce. In 2006, Yahoo cooperated with Doug Cutting and his team to create Hadoop based on GFS and MapReduce. If I tell you, you will be surprised. In 2007, Yahoo began to use Hadoop on a cluster of 1000 nodes.

In late June 2008, Yahoo released the Hadoop open source project to Apache Software Foundation. In July 2008, Apache successfully tested a cluster of 4,000 nodes through Hadoop. In 2009, Hadoop successfully sorted out PB-level data in less than 17 hours, handled billions of searches and indexed millions of web pages. 201165438+February, Apache Hadoop released version 1.0. In late August of 20 13, version 2.0.6 was released.

When we discuss these problems, we find that distributed system can be a solution, and Hadoop provides the same solution. Now, let's learn what Hadoop is.

3. What is Hadoop?

Hadoop is a framework that allows you to store big data in a distributed environment so that it can be processed in parallel. There are basically two components in Hadoop:

1, Hadoop certification training for big data

2. Teacher-guided course: case study and evaluation in real life; Visit the exploration course for life.

What is Hadoop-Hadoop framework?

The first is HDFS(Hadoop Distributed File System) for storage, which enables you to store data in various formats in a cluster. The second is YARN, which is used for resource management in Hadoop. It allows data to be processed in parallel, that is, stored across HDFS.

Let's take a look at HDFS first.

HDFS

HDFS has created an abstract concept. Let me simplify it. Similar to virtualization, you can logically think of HDFS as a single unit for storing big data, but in fact you are storing data across multiple nodes in a distributed way. HDFS follows the master-slave architecture.

What is Hadoop-HDFS?

In HDFS, the name node is the master node and the data node is the slave node. Namenode contains metadata about data stored in a data node, such as which data block is stored in which data node and where the data block is copied. The actual data is stored in the data node.

I would also like to add that we actually copied the data blocks existing in the data nodes, and the default copy factor is 3. Because we use commercial hardware and we know that the failure rate of these hardware is very high, if one of the data nodes fails, HDFS will still have copies of these lost data blocks. You can also configure the replication factor as needed. You can read the HDFS guide to learn more about HDFS.

Fourth, Hadoop is the solution.

Let's take a look at how Hadoop provides a solution to the big data problem just discussed.

What is Hadoop-Hadoop is the solution.

The first problem is to store big data.

HDFS provides a distributed method to store big data. Your data is stored in blocks throughout the DataNode, and you can specify the size of the block. Basically, if you have 5 12MB of data and HDFS is configured, it will create a block of 128MB. Therefore, HDFS divides the data into four blocks, 5 12/ 128 = 4, which are stored in different datanodes and copied to different datanodes. Now, because we use commercial hardware, storage is no longer a problem.

It also solves the scaling problem. It focuses on horizontal scaling rather than vertical scaling. Instead of expanding the resources of DataNodes, you can always add some additional data nodes to the HDFS cluster as needed. Let me summarize it for you. Basically, it is used to store the data of 1 TB, without the system of 1 TB. You can do this on multiple systems with 128GB or less.

The next problem is to store all kinds of data.

With HDFS, you can store all kinds of data, whether structured, semi-structured or unstructured. Because in HDFS, there is no pre-dump schema validation. It also follows the mode of write once and read many times. Therefore, you only need to write the data once, and you can read the data many times to find insights.

Hird's challenge is to access and process data faster.

Yes, this is one of the main challenges of big data. In order to solve this problem, we transfer processing to data instead of transferring data to processing. What does this mean? Instead of moving data to the master node and processing it. In MapReduce, processing logic is sent to each slave node, and then data is processed in parallel between different slave nodes. Then, the processed results are sent to the master node, where the results are merged and the response is sent back to the client.

In YARN architecture, we have ResourceManager and NodeManager. ResourceManager may or may not be configured on the same computer as NameNode. However, NodeManager should be configured on the same computer as DataNode.

YARN performs all processing activities by allocating resources and scheduling tasks.

What is Hadoop-yarn?

It has two main components, namely resource manager and node manager.

ResourceManager becomes the master node again. It receives the processing request and then passes each part of the request to the corresponding node manager. What is big data analysis Hadoop does the actual processing here. NodeManager is installed on each DataNode. It is responsible for performing tasks on each individual DataNode.

I hope you know something about Hadoop and its main components now. Let's continue to learn when to use Hadoop and when not to use it.

When to use Hadoop?

Hadoop is used to:

1, search-Yahoo, Amazon, Zvents

2. Log processing-Facebook, Yahoo

3. Data Warehouse-AOL Facebook

4. Video and image analysis-new york Times.

So far, we have seen how Hadoop makes big data processing possible. However, in some cases, Hadoop is not recommended.