R is a flexible programming language, which aims to promote exploratory data analysis, classical statistical testing and advanced graphics. R has a rich and expanding data packet library, which is at the forefront of statistics, data analysis and data mining. R has proved to be a useful tool in the developing field of big data, and has been integrated into several commercial software packages, such as IBM SPSS? Where's InfoSphere And Mathematica.
This paper provides the statistician Catherine Dalzell's views on R value.
Why choose r?
R can be counted. You can think of it as a competitor to analysis systems such as SAS Analytics, not to mention simpler software packages such as StatSoft STATISTICA or Minitab. Many professional statisticians and methodologists in government, enterprises and pharmaceutical industries have devoted their whole careers to IBM SPSS or SAS, but they have never written a line of R code. So to some extent, the decision to learn and use R is related to the corporate culture and how you want to work. I used various tools in the practice of statistical consultation, but most of my work was done in R. The following example illustrates why I use r:
R is a powerful scripting language. I was recently asked to analyze the results of a series of studies. The researchers examined 65,438+0,600 research papers and coded their contents according to multiple conditions. In fact, these conditions are a large number of conditions with multiple options and bifurcation. Their data (once flattened to a Microsoft? Excel? The spreadsheet contains more than 8000 columns, most of which are empty. The researchers hope to count the totals under different categories and titles. R is a powerful scripting language that can access regular expressions like Perl to process text. The messy data needs a programming language resource. Although SAS and SPSS provide scripting languages to perform unexpected tasks in the drop-down menu, R is written as a programming language, so it is a more suitable tool for this purpose.
R is at the forefront of the times. Many new developments in statistics first appeared in the form of R-package, and then were introduced into commercial platforms. I recently got data from a medical study on patients' memory. For each patient, we have the number of treatment items suggested by the doctor and the number of items actually remembered by the patient. The natural model is a beta binomial distribution. This has been known since the 1950s, but the estimation process of linking this model with the variables of interest has only recently appeared. Data like this are usually processed by the General Estimation Equation (GEE), but the GEE method is gradual and assumes a wide sampling range. I want a generalized linear model with β-binomial R. An up-to-date R-packet estimation model: betabinom written by Ben Bolker. And SPSS didn't.
Integrated document publishing. R perfectly integrates LaTeX document publishing system, which means that statistical output and graphics from R can be embedded in published documents. This is not suitable for everyone, but if you want to carry asynchronous books about data analysis, or just don't want to copy the results into a word processing document, then the shortest and most elegant path is through R and LaTeX.
There is no cost. As the owner of a small business, I like the free exclusivity of R. Even for a larger business, it is good to know that you can temporarily transfer someone and immediately let them sit on the workstation and use first-class analysis software. Don't worry about the budget.
What is R and what is its use?
As a programming language, R is similar to many other languages. Anyone who has written code will find many familiar things in R, and the particularity of R lies in the statistical philosophy it supports.
Statistical Revolution and Exploratory Data Analysis
140 Character Interpretation: R is an open source implementation of S and a programming environment for data analysis and graphics.
Computers are always good at calculating-after you write and debug a program to execute the algorithm you want. But in the 1960s and 1970s, computers were not good at displaying information, especially graphics. The trend that these technologies are limited to the combination of statistical theory means that statistical practice and statistician training focus on model construction and hypothesis testing. A hypothetical world is where researchers make assumptions (usually about agriculture), conduct well-designed experiments (at agricultural stations), fill in models, and then conduct tests. Based on spreadsheets, menu-driven programs (such as SPSS reflects this method). In fact, the first versions of SPSS and SAS Analytics contain some subroutines that can be called from (Fortran or other) programs to fill and test the models in the model toolbox.
In this framework of standardization and infiltration theory, John Tukey put forward the concept of exploratory data analysis (EDA), which is like a pebble hitting a glass roof. Nowadays, it is hard to imagine that when analyzing data sets, we don't use box charts to check skewness and outliers, or quantile charts to check the normality of linear model residuals. These ideas were put forward by Tuji, and now they will be introduced into any introductory course of statistics. But this is not always the case.
EDA is not so much a theory as a method. This method is inseparable from the following rules of thumb:
Whenever possible, you should use graphics to identify the functions of interest.
The analysis is incremental. Try the following modes; Fill another model according to the results.
Use graphs to check model assumptions. The label has an abnormal value.
Use reasonable methods to prevent violations of distribution assumptions.
Tukey's method has triggered a new wave of graphic methods and robust estimation. It also stimulates the development of new software frameworks that are more suitable for exploratory methods.
S language was developed by john Chambers of Bell Laboratories and his colleagues, and used as a statistical analysis platform, especially Tukey sorting. The first version (for the internal use of Bell Laboratories) was developed in 1976, but it was not until 1988 that a similar version was formed. At this time, users outside Bell Labs can also use the language. Every aspect of language conforms to the "new model" of data analysis:
S is an explanatory language that runs in a programming environment. The grammar of s is very similar to that of c, except that the difficult part is omitted. For example, S is responsible for memory management and variable declaration, so that users don't have to write or debug these aspects. Low programming overhead allows users to quickly perform a large number of analyses on the same data set.
From the beginning, S has considered the creation of advanced graphics, and you can add functions in any open graphics window. You can easily highlight the points of interest, query their values, make the scatter plot smoother, and so on.
Object-oriented is to add S in 1992. In programming languages, objects construct data and functions to satisfy users' intuition. Human thinking is always object-oriented, especially statistical reasoning. Statisticians deal with frequency tables, time series, matrices, spreadsheets of various data types, models and so on. In each case, the original data has attributes and expected values: for example, the time series contains observed values and time points. In addition, for each data type, standard statistics and plans should be obtained. For time series, I may draw a time series plan and a related chart; For fitting models, I may draw fitting values and residuals. Creating objects for all these concepts is supported, and you can create more object classes as needed. Object makes it very simple to conceptualize the problem and the code to implement it.
Language with attitude: s, S-Plus and hypothesis testing
The original S language attaches great importance to Tukey's EDA, and it has reached the point where EDA can only be executed in S and no other operations can be performed. This is a language with attitude. For example, although S brings some useful internal functions, it lacks some of the most obvious functions you want statistical software to have. There is no function to perform double sampling test or any real hypothesis test. But Tukey thinks that hypothesis testing is sometimes just right.
1988, Seattle's statistical science was authorized by S, and an enhanced version of the language (called S-Plus) was transplanted to DOS and later Windows? Medium. After really realizing what customers want, statistical science adds classic statistical functions to S-Plus. Add the ability to perform analysis of variance (ANOVA), tests and other models. For the object-oriented characteristics of S, the result of any such fitting model is itself an S object. Appropriate function calls will provide fitting values, residuals and p values of hypothesis testing. Model objects can even contain intermediate calculation steps of analysis, such as QR decomposition of design matrix (where q is diagonal and r is upper right corner).
There is an R package that can accomplish this task! There is also an open source community.
Around the same time that S-Plus was released, Ross Ihaka and Robert Gentleman of the University of Auckland in New Zealand decided to try to write an interpreter. They chose S language as their model. The project gradually took shape and gained support. They named it R.
R is the implementation of S, which contains more models developed by S-Plus. Sometimes, the same people are playing roles. R is an open source project licensed by GNU. On this basis, R continues to develop, mainly by adding packages. An R package is a collection of data sets, R functions, documents and C or Fortran dynamic plug-ins that can be installed together and accessed from an R session. R package adds new functions to R. Through these packages, researchers can easily enjoy computing methods among peers. Some software packages are limited in scope, others represent the whole statistical field, and some contain the latest technological development. In fact, many developments in statistics first appeared in the form of R-package, and later they were applied to commercial software.
At the time of writing, there are 470 1 r packages on the r download website CRAN. Among them, six r's were added on that day alone. Everything has a corresponding R package, at least it seems so.
What happens when I use R?
Remarks: This article is not an R course. The following example just tries to give you an idea of R conversation.
R binary can be used for Windows, Mac OS X and multiple Linux? Distribution. The source code can also be compiled by people themselves.
On Windows? , the installer adds r to the start menu. To start R in Linux, open a terminal window and type R at the prompt. You should see a screen similar to figure 1.
Figure 1. R workspace
Type a command at the prompt and r will respond.
At this point, in a real environment, you can read data from an external data file into an R object. R can read data from files in various formats, but for this example, I use michelson data in the MASS package. This package comes with the iconic text Modern Applied Statistics with S-Plus from venables and ripley (see Resources). Michelson contains the results of the popular Michelson and Molly experiments to measure the speed of light.
The command provided in listing 1 can load the MASS package and get and view Michelson data. Figure 2 shows these commands and the response from R. Each line contains an r function whose parameters are enclosed in square brackets ([]).
Listing 1. Start the rest period
2+2 # R can be a calculator. The correct answer to R is 4.
Library("MASS") # Transfer functions and data sets from
# package quality, accompanied by modern applied statistics in S.
Data(michelson) # Copy Michelson dataset to the workspace.
Ls() # Lists the contents of the workspace. Michelson data is there.
Head(michelson) # displays the first few rows of the dataset.
# train speed includes Michelson and Molly's estimation.
# The speed of light minus 299,000 in kilometers per second.
# Michelson and Molly conducted five experiments, 20 times each time.
# Dataset contains index variables for experiments and operations.
Help(michelson) # calls the help screen to describe the data set.
Figure 2. Session initiation and r response
Now let's look at the data (see Listing 2). The output is shown in Figure 3.
The block diagram in Listing 2. rare
# Basic Box Diagram
with(michelson,boxplot(Speed ~ Expt))
# I can add colors and labels. I can also save the results to an object.
Michelson. bp = with (Michelson, boxplot(Speed ~ Expt, xlab="Experiment ",las= 1,
Ylab= "Light speed-299,000m/s",
Main= "Michelson-Morley experiment",
col="slateblue 1 ")
# At this scale, the current estimate of the speed of light is 734.5.
# Add a horizontal line to highlight the value.
Abline(h=734.5, lwd=2, col = "purple ")# Add modern light speed.
Michelson and Molly seem to systematically overestimate the speed of light. These experiments seem to be somewhat inconsistent.
Figure 3. Draw a block diagram.
After I am satisfied with the analysis, I can save all the commands into an R function. See Listing 3.
A simple function in Listing 3. rare
MyExample = function(){
Library (public)
Data (Michelson)
Michelson. bw = with (Michelson, boxplot(Speed ~ Expt, xlab="Experiment ",las= 1,
Ylab= "light speed-299,000m/s", main= "Michelson-Morey experiment",
col="slateblue 1 ")
abline(h=734.5,lwd=2,col="purple ")
}
This simple example demonstrates several important functions of R:
Save the results—The —boxplot () function returns some useful statistics and charts. You can save these results into an R object through a negative statement like michelson.bp = … and extract them when necessary. The result of any assignment statement can be obtained in the whole process of R session and can be the object of further analysis. The boxplot function returns a statistical data matrix (median, quartile, etc.). ) is used to draw box charts, the number of items in each box chart and abnormal values (shown as empty circles in the chart of Figure 3). Please refer to Figure 4.
Figure 4. Statistical data of box graph function
Formula language-R (and S) has a concise language to express statistical models. The code Speed ~ Expt in the parameter tells the function to draw the box diagram of speed at each Expt level. If you want to do variance analysis to test whether there is a significant difference in speed in each experiment, you can use the same formula: lm(Speed ~ Expt). Formula language can express a variety of statistical models, including crossover and nesting effects, as well as fixed and random factors.
User-defined r function-this is a programming language.
R has entered 2 1 century.
Tukey's exploratory data analysis method has become a regular course. We are teaching this method and statisticians are using it. R supports this method, which explains why it is still so popular. Object-oriented also helps R keep up to date, because new data sources need new data structures to perform analysis. Information circle? Streams now supports R-analysis of data different from that envisioned by john Chambers.
R and InfoSphere streams
InfoSphere Streams is a computing platform and integrated development environment for analyzing high-speed data obtained from thousands of sources. The content of these data streams is usually unstructured or semi-structured The purpose of analysis is to detect changing patterns in data and guide decision-making according to rapidly changing events. SPL (programming language for InfoSphere Streams) organizes data through an example, which reflects the dynamic characteristics of data and the need for rapid analysis and response.
We are far away from spreadsheets and conventional flat files used for classical statistical analysis, but R can handle it well. Starting from version 3. 1, SPL applications can transfer data to R, thus taking advantage of R's huge package library. InfoSphere Streams supports R by creating appropriate R objects to receive information contained in SPL tuples (the basic data structure of SPL). Therefore, you can pass InfoSphere Streams data to R for further analysis and send the results back to SPL.
Does r need mainstream hardware?
I ran this example on an Acer netbook running Crunchbang Linux. R Heavy machines are not required to perform small and medium-sized analysis. For 20 years, people always think that R is slow because it is an explanatory language, and the size of data it can analyze is limited by computer memory. This is true, but it usually has nothing to do with modern machines unless the application is very large (big data).
Disadvantages of r
To be fair, there are some things that R can't do well or won't do at all. Not every user is suitable for using r:
R is not a data warehouse. The easiest way to enter data in R is to enter data elsewhere and then import it into R. People try to add a spreadsheet front end for R, but they haven't caught on yet. Lack of spreadsheet function will not only affect data input, but also make it difficult to view the data in R as intuitively as in SPSS or Excel.
R makes ordinary tasks difficult. For example, in medical research, the first thing you do with data is to calculate the general statistics of all variables and list the unresponsive places and missing data. This can be done in SPSS with only three clicks, but R has no built-in function to calculate these very obvious information and display it in the form of a table. You can write some code very easily, but sometimes you just want to point to the information to be calculated and click the mouse.
R's learning curve is extraordinary. Beginners can open a menu-driven statistical platform and get the results in a few minutes. Not everyone wants to be a programmer and then an analyst. Maybe not everyone needs to do this.
R is open source. R community is huge, mature and active, and R is undoubtedly a relatively successful open source project. As mentioned earlier, R has been implemented for more than 20 years, and S language has existed for a longer time. This is a tried-and-tested concept and a tried-and-tested product. But for any open source product, reliability is inseparable from transparency. We trust its code because we can check it ourselves and others can check it and report errors. This is different from enterprise projects, which perform benchmark tests and verify their software themselves. There is no reason for you to assume that R packages that are not used frequently will actually produce correct results.
Concluding remarks
Do I need to study R? Maybe not; Necessity is a word with strong feelings. But is R a valuable data analysis tool? Of course it is. This language is designed to reflect the way statisticians think and work. R consolidate good habits and make reasonable analysis. For me, it is a tool suitable for my work.