Current location - Education and Training Encyclopedia - Graduation thesis - A detailed introduction to big data analysis tools &; Data analysis algorithm
A detailed introduction to big data analysis tools &; Data analysis algorithm
A detailed introduction to big data analysis tools &; Data analysis algorithm

1、Hadoop

Hadoop is a software framework that can distribute large amounts of data. But Hadoop is handled in a reliable, efficient and extensible way. Hadoop is reliable because it assumes that computing elements and storage will fail, so it maintains multiple copies of working data to ensure that processing can be redistributed for failed nodes. Hadoop is efficient because it works in parallel, which speeds up the processing. Hadoop is also extensible and can handle PB-level data. In addition, Hadoop relies on community servers, so the cost is relatively low and anyone can use it.

Hadoop is a distributed computing platform that users can easily build and use. Users can easily develop and run applications dealing with massive data on Hadoop. It mainly has the following advantages:

1. High reliability. Hadoop's ability to store and process data bit by bit is trustworthy.

2. High scalability. Hadoop distributes data and completes computing tasks among available computer clusters, and can be easily extended to thousands of nodes.

3. High efficiency. Hadoop can dynamically move data between nodes to ensure the dynamic balance of each node, so the processing speed is very fast.

4. High fault tolerance. Hadoop can automatically save multiple copies of data and automatically reassign failed tasks.

Hadoop has a framework written in Java language, so it is ideal to run on Linux production platform. Applications on Hadoop can also be written in other languages such as C++.

2、HPCC

Abbreviation for HPCC, High Performance Computing and Communication. From 65438 to 0993, the Federal Science, Engineering and Technology Coordination Committee of the United States submitted the report "Major Challenge Project: High Performance Computing and Communication" to Congress, which is also known as the HPCC Plan Report, that is, the scientific strategy project of the President of the United States. Its purpose is to solve some important scientific and technological challenges by strengthening research and development. HPCC is a plan to implement the information superhighway in the United States. The implementation of this plan will cost tens of billions of dollars. Its main goal is to develop scalable computing systems and related software to support the transmission performance of Ethernet, develop gigabit network technology, and expand research and education institutions and network connection capabilities.

The project mainly consists of five parts:

1, High Performance Computer System (HPCS), including the research of future generations of computer systems, system design tools, advanced typical systems and original system evaluation, etc.

2. Advanced software technology and algorithm (ASTA), including software support for great challenges, new algorithm design, software branches and tools, computing and high-performance computing research centers, etc.

3. National Research and Education Grid (NREN), including the research and development of docking stations and 65.438 billion bit transmission;

4. Basic research and human resources (BRHR) includes basic research, training, education and course materials, aiming at increasing the innovative consciousness in the field of scalable high-performance computing by rewarding investigators (initial and long-term investigations), increasing joint ventures of skilled and trained personnel by improving education and high-performance computing training and exchanges, and providing necessary infrastructure to support these investigations and research activities;

5. Information Infrastructure Technology and Application (IITA) aims to ensure the leading position of the United States in the development of advanced information technology.

3. Storm

Storm is a free open source software, a distributed and fault-tolerant real-time computing system. Storm can handle huge data streams very reliably and can be used to handle batch data of Hadoop. Storm is simple, supports multiple programming languages and is very interesting to use. The storm comes from Twitter, and other well-known application companies include Groupon, Taobao, Alipay, Alibaba, Music Element, Admaster and so on.

Storm has many applications: real-time analysis, online machine learning, uninterrupted computing, distributed RPC (remote procedure call protocol, requesting services from remote computer programs through the network), ETL (short for extract-transform-load) and so on. The processing speed of Storm is amazing: after testing, each node can process 6.5438+0 million data tuples per second. Storm is scalable, fault-tolerant and easy to set up and operate.

4. Apache exercise

In order to help enterprise users find more effective ways to speed up Hadoop data query, Apache Software Foundation recently launched an open source project called "Drill". Apache Drill implements Google's Dremel.

According to Tomer Shiran, product manager of Hadoop manufacturer MapR Technologies, "Drill" has been operated as an Apache incubator project and will continue to be promoted to software engineers around the world.

This project will create an open source version of Google Dremel Hadoop tool (which Google uses to accelerate the Internet application of Hadoop data analysis tools). And "drilling" will help Hadoop users query massive data sets faster.

The "Drill" project is actually inspired by Google's Dremel project: this project helps Google analyze and process massive data sets, including analyzing and crawling Web documents, tracking application data installed on Android Market, analyzing spam, analyzing test results on Google's distributed construction system, and so on.

By developing the "Drill”Apache open source project, the organization will hopefully establish the API interface and flexible and powerful architecture to which Drill belongs, thus helping to support a wide range of data sources, data formats and query languages.

5、RapidMiner

RapidMiner is the world's leading data mining solution, which adopts advanced technology to a great extent. Its data mining task covers a wide range, including various data arts, which can simplify the design and evaluation of data mining process.

Functions and characteristics

Provide free data mining technology and library.

100% uses Java code (which can run in the operating system)

The process of data mining is simple, powerful and intuitive.

Internal XML ensures that the exchange data mining process is represented in a standardized format.

Large-scale processes can be automated with simple scripting languages.

Multi-level data view to ensure effective and transparent data.

Interactive prototype of graphical user interface

Command line (batch mode) automatic large-scale application

Application programming interface

Simple plug-in and upgrade mechanism

Powerful visualization engine, visual modeling of many cutting-edge high-dimensional data.

Supported by more than 400 data mining operators.

Yale University has been successfully applied in many different application fields, including text mining, multimedia mining, functional design, data stream mining, integrated development method and distributed data mining.

6、Pentaho BI

Pentaho BI platform is different from traditional BI products. It is a process-centered and solution-oriented framework. Its purpose is to integrate a series of enterprise BI products, open source software, API and other components to facilitate the development of business intelligence applications. Its appearance enables Jfree, Quartz and a series of independent products oriented to business intelligence to be integrated to form a complex and complete business intelligence solution.

Pentaho BI platform is the core architecture and foundation of Pentaho Open BI suite, which is process-centered, because its central controller is a workflow engine. The workflow engine uses process definitions to define business intelligence processes executed on the BI platform. You can easily customize the process and add new processes. The BI platform contains components and reports to analyze the performance of these processes. At present, Pentaho's main components include report generation, analysis, data mining and workflow management. These components are integrated into Pentaho platform through J2EE, WebService, SOAP, HTTP, Java, JavaScript, Portals and other technologies. Pentaho is mainly distributed in the form of Pentaho SDK.

Pentaho SDK*** consists of five parts: Pentaho platform, Pentaho sample database, Pentaho platform that can run independently, Pentaho solution sample and a pre-configured Pentaho network server. Among them, Pentaho platform is the most important part of Pentaho platform, which contains the main source code of Pentaho platform; Pentaho database provides data services for the normal operation of Pentaho platform, including configuration information, solution related information, etc. , which is not necessary for Pentaho platform and can be replaced by other database services through configuration; Pentaho platform that can run independently is an example of the independent running mode of Pentaho platform, which demonstrates how to make Pentaho platform run independently without the support of application server.

The Pentaho solution example is an Eclipse project that demonstrates how to develop related business intelligence solutions for the Pentaho platform.

Pentaho BI platform is based on servers, engines and components. These provide J2EE server, security, portal, workflow, rule engine, chart, collaboration, content management, data integration, analysis and system modeling functions. Most of these components are based on standards and can be replaced by other products.

7, SAS enterprise miners

A complete toolset supporting the whole data mining process.

Simple and easy-to-use graphical interface, suitable for different types of users to quickly model.

Powerful model management and evaluation function

Fast and convenient model publishing mechanism promotes the formation of business closed loop.

Data analysis algorithm

Big data analysis mainly relies on machine learning and large-scale computing. Machine learning includes supervised learning, unsupervised learning and reinforcement learning. Supervised learning includes classified learning, regression learning, ranking learning and matching learning (see figure 1). Classification is the most common machine learning application problem, such as spam filtering, face detection, user portrait, text sentiment analysis, web page classification and so on. In essence, it is a classification problem. Classified learning is also the most thoroughly studied and widely used branch of machine learning.

Recently, Fernández-Delgado and others published an interesting paper in JMLR (Journal of Machine Learning Research). They let 179 different classification learning methods (classification learning algorithms) compete on UCI 12 1 data set (UCI is a commonly used data set for machine learning, and the scale of each data set is not large). The results show that Langdon Forest and SVM rank first and second, but there is little difference between them. On 84.3% of the data, random forests overwhelm other 90% methods. That is to say, in most cases, just use random forest or SVM to get things done.

KNN

K nearest neighbor algorithm. Given some training data, enter a new test data point and calculate the classification of the nearest point contained in this test data point. Which classification type is the majority, the classification of this test center is the same, so sometimes different classification points can be copied here with different weights. The nearest point is more important, and the farther point is naturally smaller. Introduce the link in detail

Naive Bayes

Naive Bayesian algorithm. Naive Bayesian algorithm is a relatively simple classification algorithm in Bayesian algorithm, which uses an important Bayesian theorem. In a word, it is the mutual transformation and derivation of conditional probability. Introduce the link in detail

Naive Bayesian classification is a very simple classification algorithm. It is called naive Bayesian classification because the idea of this method is really naive. The ideological basis of naive Bayesian classification is as follows: for a given item to be classified, under the condition that the item appears, the probability of solving each category is the largest, and you will think which category the item to be classified belongs to. Generally speaking, this is the truth. When you see a black man in the street, I ask you where this guy is from. Nine times out of ten, you guess Africa. Why? Because Africans have the highest proportion among blacks, of course, people may be Americans or Asians, but in the absence of other available information, we will choose the category with the greatest conditional probability, which is the ideological basis of naive Bayes.

SVM

Support vector machine algorithm. Support Vector Machine (SVM) algorithm is a method to classify linear and nonlinear data. When classifying nonlinear data, the kernel function can be transformed into linearity and then processed. One of the key steps is to find the maximum edge hyperplane. Introduce the link in detail

conjectural

Apriori algorithm is an association rule mining algorithm, which mines frequent itemsets through connection and pruning operations, and then obtains association rules according to frequent itemsets. The export of association rules needs to meet the minimum confidence requirements. Introduce the link in detail

PageRank

Web page importance/ranking algorithm. PageRank algorithm was first produced in Google, and its core idea is to judge the speed of a web page by the number of links in it. If the 1 webpage contains multiple external links, the PR value will be equally divided, and the PageRank algorithm will also be attacked by LinkSpan. Introduce the link in detail

Random forest

Random forest algorithm. The idea of the algorithm is decision tree +boosting. CART classification regression number is used in decision tree. The final strong classifier is formed by combining the weak classifiers of each decision tree. When constructing a decision tree, a sub-decision tree is constructed by using a random number of samples and random partial attributes, which avoids over-fitting. Introduce the link in detail

artificial neural network

The word "neural network" actually comes from biology, and the correct name of the neural network we refer to should be "artificial neural networks (ANNs)".

Artificial neural network also has preliminary adaptive and self-organizing ability. Change the synaptic weight in the process of learning or training to meet the requirements of the surrounding environment. The same network can have different functions due to different learning methods and contents. Artificial neural network is a system with learning ability, which can develop knowledge beyond the original knowledge level of designers. Usually, its learning and training methods can be divided into two types. One is supervised or supervised learning, in which given sample criteria are used for classification or imitation. The other is unsupervised learning or unsupervised tutor learning. At this time, only the learning method or some rules are specified, and the specific learning content changes with the environment in which the system is located (that is, the input signal situation). The system can automatically discover environmental characteristics and laws, which is more similar to the function of the human brain.