How to make Hadoop combine with R language to do big data analysis?
R language and Hadoop make us realize the power of the two technologies in their respective fields. Many developers will ask the following two questions from the perspective of computers. Question 1: Hadoop family is so powerful, why should it be combined with R language? \x0d\ Question 2: Mahout can also do data mining and machine learning. What is the difference between R language and R language? Let me try to answer the question: Question1:Why should Hadoop be combined with R language when its family is so powerful? \ x0d \ \ x0d \ a. The strength of Hadoop family lies in the processing of big data, which makes it possible to do things that were originally impossible (calculation of TB and PB data). The strength of language \x0d\b. R lies in statistical analysis. Before Hadoop, we must sample samples, test hypotheses and regress the processing of big data. For a long time, R language has been the exclusive tool for statisticians. \ x0d \ C. As can be seen from points A and B, hadoop focuses on total data analysis, while R language focuses on sample data analysis. When the two technologies are put together, they are just complementary! \ x0d \ d. Simulation scenario: analyze the visit logs of 1PB news website and predict the future traffic changes \x0d\d 1: By analyzing a small amount of data, establish a regression model for business objectives with R language and define indicator d2: extract indicator D3 from massive log data with Hadoop; Use R language model to test and adjust the index data. In the mind of computer developers, Hadoop is doing everything. Without data modeling and proof, the "predicted result" is definitely problematic. According to the statistician's thinking, everything is done with R, and the "prediction result" obtained by sampling is definitely problematic. Therefore, the combination of the two is the inevitable orientation of the industry, the intersection of the industry and the academic community, and it also provides an infinitely broad imagination space for interdisciplinary talents. Question 2: Mahout can also do data mining and machine learning. What is the difference between R language and R language? \x0d\\x0d\a. Mahout is an algorithm framework for data mining and machine learning based on Hadoop, and the focus of Mahout is also to solve the calculation problem of big data. \ x0d \ b mahout currently supports collaborative filtering, recommendation algorithm, clustering algorithm, classification algorithm, LDA, naive Bayes and random forest. Most of the above algorithms are distance algorithms. After matrix decomposition, we can make full use of MapReduce's parallel computing framework to complete the computing task efficiently. There are many data mining algorithms for the blank point of \x0d\c. Mahout, so it is difficult to realize MapReduce parallelization. Mahout's existing models are all general models, and the calculation results will only be a little better than the random results in the directly used projects. The secondary development of Mahout needs a deep foundation of JAVA and Hadoop technology, and it is better to have basic knowledge such as Linear Algebra, Probability Statistics and Introduction to Algorithms. So it's really not easy to play the elephant man. \x0d\d. R language also provides most of the algorithms supported by mahout (except proprietary algorithms), and also supports a large number of algorithms that Mahout does not support, and the growth rate of the algorithms is n times faster than that of Mahout. The development is simple, the parameter configuration is flexible, and the operation speed for small data sets is very fast. \x0d\ Although Mahout can also do data mining and machine learning, it does not coincide with the major of R language. Only by combining the strengths of hundreds of schools and choosing the right technology in the right field can we really make software with good quality and quantity. \x0d\\x0d\ How to combine Hadoop with R language? \x0d\\x0d\ From the last section, we can see that Hadoop and R language can complement each other, but the scenario introduced is that Hadoop and R language process their own data respectively. Once there is demand in the market, there will naturally be businesses to fill this gap. \ x0d \ \ x0d \ 1)。 RHadoop \ x0d \ \ x0d \ rHadoop is a product combining Hadoop and R language, developed by RevolutionAnalytics, and the code is open source to github community. RHadoop contains three R packages (rmr, rhdfs and rhbase), which correspond to MapReduce, hdfs and hbase in Hadoop system architecture. \x0d\\x0d\2)。 RHiveRHive is a toolkit for accessing Hive directly through R language, which was developed by a Korean company of NexR. \x0d\\x0d\3)。 Rewriting Mahout Rewriting the implementation of Mahout in R language is also a combined idea, and I have also made relevant attempts. \x0d\\x0d\4)。 Hadoop calls r \ x0d \ x0d \ That's all about how R calls Hadoop. Of course, we can also operate in reverse, open the connection channel between JAVA and R, and let Hadoop call R's function. But this part has not yet been made into a molded product. \x0d\\x0d\5。 R and Hadoop in the actual situation \ x0d \ x0d \ r combined with Hadoop, the technical threshold is still a bit high. For a person, not only should he master technologies such as Linux, Java, Hadoop and R, but he should also have some basic qualities such as software development, algorithms, probability statistics, linear algebra, data visualization and industry background. Deploying this environment in the company also requires the cooperation of multiple departments and talents. Hadoop operation and maintenance, Hadoop algorithm research and development, R language modeling, R language MapReduce, software development, testing and so on. So this situation is not much.