Overview of multivariate statistical analysis

I will add links to the study notes in each chapter later.

Multivariate statistical analysis is a subject that studies the interdependence between multiple random variables and their internal statistical laws.

In the summary of the basic contents of statistics, only the influence of one or several factors on an observation index (variable) is considered, which is called unary statistical analysis.

If we consider the influence of one or several factors on two or more observation indexes (variables), or the interdependence of multiple observation indexes (variables), it is called multivariate statistical analysis.

There are two categories, including:

Classify the data and find out their relationships and internal laws.

Cluster analysis and discriminant analysis techniques are usually used to construct classification models.

Find the best subset of each variable among many factors, describe the results of multivariate system and the influence of each factor on the system according to the confidence contained in the subset, and discard the secondary factors to simplify the system structure and understand the core of the system.

Principal component analysis, factor analysis and correspondence analysis can be used.

The contents of multivariate statistical analysis mainly include: multivariate data graphic method, multivariate linear correlation and regression analysis, discriminant analysis, cluster analysis, principal component analysis, factor analysis, correspondence analysis and canonical correlation analysis.

Multivariate data refers to data with multiple variables. If each variable is regarded as a random vector, the data set formed by multiple variables will be a random matrix, so the basic expression of multivariate data is a matrix. The mathematical representation of these data matrices is our main task. In other words, the basic operation of multivariate data is matrix operation, and R language is an excellent matrix operation language, which is also a great advantage for us to apply it.

Visual analysis, that is, graphical method, is an important auxiliary means of data analysis. For example, the scatter plot of two variables can be used to investigate the influence of abnormal observations on the correlation coefficient of samples, the matrix scatter plot can be used to investigate the relationship between variables, and the multivariate box tail plot can be used to compare the differences of basic statistics of several variables.

Correlation analysis is to observe a large number of digital data, exclude the influence of accidental factors, and explore the tightness and manifestation of correlation between phenomena. In the economic system, various economic variables often have internal relations. Such as economic growth and fiscal revenue, per capita income and consumption expenditure. Some of these relationships are strictly functional and can be expressed by mathematical expressions. There are still some uncertain relationships. The change of one variable will affect other variables and make them change. This change is random, but it still follows certain laws. Functional relationships are easy to solve, and those uncertain relationships, that is, correlation relationships, are what we care about.

The main object of regression analysis is the statistical relationship between objective variables. It is based on a large number of experiments and observations of objective things, and is used to find statistical laws hidden in seemingly uncertain phenomena. Regression analysis can not only reveal the influence of independent variables on dependent variables, but also predict and control them with regression equations. The main research scope of regression analysis includes:

(1) linear regression model: one-dimensional linear regression model, multiple linear regression model.

(2) Diagnosis of regression model: rationality of basic assumptions of regression model, judgment of fitting effect of regression equation and selection of regression function form.

(3) Generalized linear model: regression uses qualitative variables, independent variables use qualitative variables, and dependent variables use qualitative variables.

(4) Nonlinear regression model: one-dimensional nonlinear regression and multiple nonlinear regression.

In practical research, it is often encountered that a random variable changes with one or more non-random variables, and this relationship is obviously nonlinear. How to express it with a better model, then estimate and predict it, and test its nonlinearity has become an important problem. In economic forecasting, multiple regression models are often used to reflect the dependence between forecasting quantity and various factors, among which linear regression analysis is widely used. However, the relationship between objective things is not necessarily linear. In some cases, nonlinear regression model is more suitable, but it is more difficult to establish. In the actual production process, there is a correlation between the parameters of production management objectives and the processing capacity. With the increase of production and processing capacity, the parameters of production management objectives (such as production cost and production hours, etc.). ) is not a simple linear growth. At this time, nonlinear regression analysis is needed for analysis.

In view of the diversity and adaptability of statistical models, statistical models can be divided into many types according to the value nature of dependent variables and explanatory variables. Usually, linear models with qualitative independent variables are called general linear models, such as experimental design models and variance analysis models. Linear models with non-normal distribution of dependent variables are called generalized linear models, such as Logistic regression model, logarithmic linear model and Cox proportional hazard model.

1972, Nelder further expanded the classical linear regression model and established a unified theoretical and computational framework, which had an important impact on the application of regression model in statistics. This new linear regression model is called generalized linear model (GLM).

Generalized linear model is a generalization of multivariate linear regression model, and it can also be regarded as a special case of nonlinear model from another angle. They have some * * * properties that other nonlinear models do not have. The difference between it and the typical linear model is that its random error distribution is not normal, and the biggest difference from the nonlinear model is that there is no clear assumption of random error distribution in the nonlinear model, but the distribution of random error in the generalized linear model can be determined. Generalized linear model includes not only discrete variables, but also continuous variables. Normal distribution is also included in the exponential distribution family, which contains parameters describing divergence and belongs to the two-parameter exponential distribution family.

Discriminant analysis is a statistical analysis method used to distinguish sample types in multivariate statistical analysis. The so-called discriminant analysis method is that once there is a new sample under the known classification, it can be used to select a discriminant standard to decide which category to put the new sample in. The purpose of discriminant analysis is to establish a classification rule composed of numerical indicators for classifying known data, and then apply such rules to classify unknown samples. For example, we have obtained some laboratory indexes of gastritis patients and healthy people, from which we can find the differences between the two types of people. This difference is expressed as a discriminant formula, and then people who are suspected of having gastritis can be diagnosed with this discriminant formula according to their laboratory indicators.

Cluster analysis is a modern statistical analysis method to study birds of a feather flock together. In the past, people mainly relied on experience and professional knowledge for qualitative classification, but rarely used mathematical methods, which made many classifications subjective and arbitrary, and could not well reveal the inherent essential differences and connections of objective things, especially for multi-factor and multi-index classification problems, it was more difficult to achieve accurate classification through qualitative classification. In order to overcome the deficiency of qualitative classification, multivariate statistical analysis is gradually introduced into numerical taxonomy, forming a branch of cluster analysis.

Cluster analysis is a classification technique. Compared with other multivariate analysis methods, this method is rough and imperfect in theory, but it has achieved great success in application. Cluster analysis, regression analysis and discriminant analysis are called the three main methods of multivariate analysis.

In practical problems, we often encounter the problem of multiple research. But in most cases, there is a certain correlation between different variables, which will inevitably increase the complexity of the analysis problem. Principal component analysis (PCA) is a statistical analysis method that transforms multiple indicators into a few comprehensive indicators by dimensionality reduction technology. How to synthesize the complicated indexes into several less components is not only beneficial to analyze and explain the problems, but also convenient to grasp the main contradictions and make a scientific evaluation. At this time, principal component analysis can be used.

Factor analysis is an extension of principal component analysis, and it is also a multivariate analysis method that turns multiple variables into a few comprehensive variables, but its purpose is to explain the correlation between original variables with finite unobservable hidden variables. Principal component analysis combines the original variables into several principal components by linear combination, and replaces more indicators (variables) with fewer comprehensive indicators. In multivariate analysis, there is often correlation between variables. What are the reasons for the correlation between variables? Is there a common factor that cannot be directly observed but affects the change of observable variables?

Factor analysis is to find the statistical analysis method of these common factors, that is, to construct some clear common factors on the basis of principal components, decompose the original variables with them as a framework, and investigate the connections and differences between the original variables. For example, studying the price changes in the pastry industry, there are many kinds of cakes, from hundreds to even thousands, but no matter what style of cakes, the materials used are nothing more than flour, cooking oil, sugar and other main raw materials. Then, flour, edible oil and sugar are the common factors of many cakes, and the price changes of various cakes are closely related to the price changes of flour, edible oil and sugar. To understand or control the price changes in the cake industry, we only need to master the prices of flour, edible oil and sugar.

Correspondence analysis, also known as correspondence analysis, was put forward by French statistician J.P.Beozecri in 1970. Correspondence analysis is a multivariate statistical method developed on the basis of factor analysis, and it is a joint application of Q-type and R-type factor analysis. In the statistical analysis of economic management data, we often have to deal with three kinds of relationships, namely, the relationship between samples (Q-type relationship), the relationship between variables (R-type relationship) and the relationship between samples and variables (corresponding relationship). For example, when evaluating the economic benefits of enterprises belonging to a certain industry, we should not only study the relationship between economic benefits indicators, but also classify enterprises according to the quality of economic benefits, and study which enterprises are more closely related to which economic benefits indicators, so as to provide more information for decision-making departments to correctly guide the production and business activities of enterprises. This requires a statistical method to analyze, classify and map enterprises (samples) and indicators (variables) together for economic explanation. The statistical method to solve this kind of problem is correspondence analysis.

In correlation analysis, when a group has only two variables, it can be measured by simple correlation coefficient; When there are multiple variables in a group, it can be measured by complex correlation coefficient. A lot of practical problems require us to extend the relationship between indicators to two groups of variables, that is, the interdependence between two groups of random variables. Canonical correlation analysis is an analytical method to solve this kind of problems. In fact, it uses the idea of principal component to discuss the correlation between two groups of random variables, and transforms the correlation between two groups of variables into the correlation between several pairs of variables, which are irrelevant, thus simplifying the complex correlation.

Canonical correlation analysis is widely used in the empirical study of economic management, because many economic phenomena are the relationships between multiple variables. For example, when studying the causes of inflation, we can take several price indexes as a set of variables and several factors that affect price changes as another set of variables, find out several pairs of main comprehensive variables through canonical correlation analysis, and combine canonical correlation coefficient with the causes of price rise and inflation to give a deeper analysis result.

Multidimensional scaling (MDS) is a multivariate data analysis method, which expresses the similarity or affinity between objects in the form of spatial distribution. 1958, Torgerson formally proposed this method for the first time in his doctoral thesis. MDS analysis is common in marketing, and it has been applied more and more in the field of economic management in recent years, but there are few reports on its application in China. Through a series of techniques, multidimensional scaling method enables researchers to identify the key dimensions that form the basis of subjects' evaluation samples. For example, multi-dimensional scale is usually used in market research to determine the key dimensions that form the basis for customers to evaluate products, services or companies. Other applications such as comparing natural attributes (such as food taste or different smells), understanding political candidates or events, and even evaluating cultural differences between different groups. Multidimensional scaling method deduces the intrinsic dimension by judging the similarity or preference of the samples provided by the subjects. Once the data is available, it can be analyzed by multi-dimensional scaling method: ① which dimensions are used by the subjects in evaluating the samples; (2) In some cases, how many dimensions can subjects use; ③ The relative importance of each dimension; (4) How to get the perceptual knowledge of sample correlation.

The 1970s and 1980s witnessed the vigorous development of modern scientific evaluation. During this period, many evaluation methods have emerged, such as ELECTRE method, linear programming method of multidimensional preference analysis (LINMAP), analytic hierarchy process (AHP), data envelopment analysis (EDA) and ranking method close to ideal solution (TOPSIS). These methods have been developed and widely used.

The development of modern scientific evaluation in China was in the 1980s and 1990s, and great achievements were made in the research of evaluation methods and their applications. The comprehensive evaluation method has been applied to various sectors of the national economy, such as the comprehensive evaluation of sustainable development, the well-off evaluation system, the modernization index system and the international competitiveness evaluation system.

Multi-index comprehensive evaluation method has the following characteristics: it contains multiple indexes, which respectively explain different aspects of the evaluated object; Finally, the evaluation method should evaluate the evaluated object as a whole, and use a general index to explain the overall level of the evaluated object.

At present, there are many commonly used comprehensive evaluation methods, such as comprehensive evaluation method, comprehensive index method, rank sum ratio method, analytic hierarchy process, TOPSIS method, fuzzy comprehensive evaluation method, data envelopment analysis method and so on.

R- forever ~

A Case Study on Combating False Advertising

Sophomore students' practical activities in textile mills.

Comment on the title requirements of advanced papers

Finished paper

Inspirational writing is 800 words and 5 articles.

Application of Situational Teaching Method in Pharmaceutical Practice Teaching

How to automatically adjust the typesetting of word documents?

English paper search

What does Tsinghua ai mean?

Time for Blind Examination of Master's Thesis in Chengdu Medical College