Regression analysis is a statistical analysis method to determine the quantitative relationship between two or more variables. It is widely used. Regression analysis is divided into regression analysis and multiple regression analysis according to the number of independent variables involved. According to the number of independent variables, it can be divided into univariate regression analysis and multivariate regression analysis; According to the type of relationship between independent variables and dependent variables, it can be divided into linear regression analysis and nonlinear regression analysis. If the regression analysis contains only one independent variable and one dependent variable, and the relationship between them can be approximately expressed by a straight line, this regression analysis is called unary linear regression analysis. If regression analysis includes two or more independent variables, and there is a linear relationship between dependent variables and independent variables, it is called multivariate linear regression analysis.
definition
Regression analysis is one of the most widely used data analysis methods. Based on the observed data, it establishes the appropriate dependence between variables to analyze the internal laws of the data, which can be used for forecasting, control and other issues.
Homogeneity of variance
linear relation
Effect accumulation
Variable without measurement error
Variables obey multivariate normal distribution.
Observation independence
The model is complete (there are no variables that should not be input, and there are no variables that should be input).
The error term is independent and obeys the (0, 1) normal distribution.
Real data often cannot fully satisfy the above assumptions. Therefore, statisticians have developed many regression models to solve the constraints of the hypothetical process of linear regression models.
A statistical method to study the relationship between one or more random variables Y 1, Y2, …, Yi and other variables X 1, …, Xk, which is also called multiple regression analysis. Generally, Y 1, Y2, …, Yi are dependent variables, and X 1, …, Xk are independent variables. Regression analysis is a mathematical model, especially when the dependent and independent variables are linear, it is a special linear model. The simplest case is an independent variable and a dependent variable, which are generally linear, which is called unary linear regression, that is, the model is Y=a+bX+ε, where x is the independent variable, y is the dependent variable and ε is the random error. It is generally assumed that the average random error is 0 and the variance is σ 2 (σ 2 is greater than 0). σ 2 has nothing to do with the value of X. If we further assume that the random error obeys a normal distribution, it is called a normal linear model. Generally speaking, it has k independent variables and a dependent variable, and the value of the dependent variable can be divided into two parts: one part is due to the influence of the independent variable, that is, the function expressed as the independent variable, in which the form of the function is known, but it contains some unknown parameters; The other part is due to other factors and randomness, that is, random error. When the function form is a linear function with unknown parameters, it is called a linear regression analysis model; When the function is a nonlinear function with unknown parameters, it is called a nonlinear regression analysis model. When the number of independent variables is greater than 1, it is called multiple regression, and when the number of dependent variables is greater than 1, it is called multiple regression.
The main contents of regression analysis are:
① According to a set of data, determine the quantitative relationship between some variables, that is, establish a mathematical model and estimate unknown parameters. The commonly used method for estimating parameters is the least square method.
② Test the credibility of these relationships.
(3) In the relationship between multiple independent variables * * * and a dependent variable, it is usually used to judge which (or which) independent variables have significant influence and which independent variables have no significant influence, put the independent variables with significant influence into the model, and eliminate the variables with no significant influence, such as stepwise regression, forward regression and backward regression.
④ Use the required relationship to predict or control the production process. Regression analysis is widely used, and statistical software package makes the calculation of various regression methods very convenient.
In regression analysis, variables are divided into two categories. One is the dependent variable, which is usually a kind of index concerned in practical problems, usually expressed by y; Another variable that affects the value of the dependent variable is called the independent variable, which is represented by X.
The main problems of regression analysis research are:
(1) The expression for determining the quantitative relationship between y and x is called regression equation;
(2) Test the reliability of the regression equation;
(3) judging whether the independent variable X has influence on the dependent variable Y;
(4) Using the obtained regression equation to forecast and control.
Regression analysis can be said to be the most abundant and widely used branch of statistics. This is no exaggeration. Including the simplest t-test and variance analysis, it can also be classified as linear regression. Chi-square test can also be replaced by logistic regression.
There are many names of regression, such as linear regression, logistic regression, cox regression, poission regression, probit regression and so on, which can always make you dizzy. In order to let everyone have a clear understanding of many returns, here is a brief summary:
1, linear regression, which is the earliest regression we came into contact with when studying statistics. Even if you don't know anything else, at least you should know that the dependent variable of linear regression is continuous variable, and the independent variable can be continuous variable or classified variable. If there is only one independent variable and only two types, then this regression is equivalent to t-test. If there is only one independent variable and there are three or more categories, then this regression is equivalent to analysis of variance. If there are two independent variables, one is continuous variable and the other is classified variable, then this regression is equivalent to covariance analysis. Therefore, linear regression must be accurate and dependent variables must be continuous.
2.logistic regression and linear regression have become two major regression, and their application scope is no less than linear regression, and even has the potential to shine brilliantly. Because logistic regression is so simple and practical. It can be directly explained that if there are certain risk factors, the risk of illness will increase by 2.3 times, which sounds easy to understand. Compared with linear regression, its practical significance is weak. Logistic regression is just the opposite of linear regression, and the dependent variable must be classified variable, not continuous variable. Classification variables can be binary or multi-classification, and multi-classification can be ordered or disordered. Binary logistic regression is sometimes divided into conditional logistic regression and unconditional logistic regression according to the research purpose. Conditional logistic regression is used for the analysis of paired data, and unconditional logistic regression is used for the analysis of unpaired data, that is, direct random sampling data. Disordered multi-classification logistic regression sometimes becomes polynomial logit model, and ordered logistic regression is sometimes called cumulative ratio logit model.
3, cox regression, the dependent variable of cox regression is somewhat special, because his dependent variable must have two at the same time, one representing the state and the other representing the time, and it should be a continuous variable. Cox regression analysis can only be used when these two variables are available at the same time. Cox regression is mainly used for the analysis of survival data. It has at least two outcome variables, one is death, is it alive or dead? The second is the time of death. If death happens, when will it happen? If alive, how long has it been from the beginning to the end? So with these two variables, we can consider using cox regression analysis.
4. Poisson regression and Poisson regression are not as widely used as the first three. But in fact, if logistic regression can be used, Poisson regression can usually be used. The dependent variable of Poisson regression is the number, that is, how many people get sick after a period of observation? Or how many people died? Wait a minute. In fact, it is similar to logistic regression, because the result of logistic regression is morbidity or mortality, and the number of morbidity and mortality is also needed. When you think about it, it's actually the same as how many people get sick and how many people die. It's just that the return of poission is not as famous as logist, so there are not as many people using it as logist. But don't think Poisson regression is useless.
5.probit regression is really useless in medicine. The key problem is that the word probit is too difficult to understand and is usually translated into probability units. Probit function is actually very close to logistic function, and their analysis results are also very close. Unfortunately, the practical significance of probit regression is really not as easy to understand as logistic regression, which makes it obscure, but it seems to be used more in the field of sociology.
6. Negative binomial regression. The so-called negative binomial refers to a distribution, which is actually similar to Poisson regression and logistic regression. Poisson regression is used for data subject to Poisson distribution, logistic regression is used for data subject to binomial distribution, and negative binomial regression is used for data subject to negative binomial distribution. Speaking of these distributions, everyone doesn't want to hear it. What an abstract noun, I also have a headache. If simply understood, binomial distribution can be regarded as binary classified data, and Poisson distribution can be regarded as counting data, that is, numbers, not height. Height may have decimal points, but numbers may not. Negative binomial distribution is also a number, but it is more demanding than Poisson distribution. If your ending is a number and the ending may be aggregated, then it may be a negative binomial distribution. For a simple example, if the influencing factors of influenza are investigated, the result is of course the number of influenza cases. If some people in the survey are in the same family, because the flu is contagious, then if one person in the same family gets the flu, others may get the flu, so this is clustering. Although the result of this kind of data is a number, Poisson regression is not necessarily suitable because of its clustering, and negative binomial regression can be considered. Since this example is mentioned, the data used for logistic regression can usually be returned by poission. Like the above case, we can divide the ending into two categories. Everyone has two states, flu or no flu. This is a binary ending, so logistic regression can be used. But what if the data here is clustered? Fortunately, there are more extensions than logistic regression. You can use multi-level logistic regression model or consider generalized estimation equation. Both methods can deal with binary dependent variables with stratified or repeated measurement data.
7, Weibull regression, sometimes Chinese transliteration is Weibull regression. Maybe you haven't heard the news of Wilbur's return. In fact, this name is just a gimmick to scare people. As mentioned in the last article, cox regression is a commonly used method in survival data analysis, which almost dominates the whole survival analysis. But in fact, there are still several methods in the cracks, which are tenacious and full of vitality, but most of them are unwilling to use them in China. Weibull regression is one of them. Why is cox back in fashion? Because it is simple, it can be used without considering conditions (except proportional conditions), so most survival data can be used. Weibull regression is conditional, and the data must conform to Weibull distribution when used. What, distribution again? ! I guess everyone's head is getting bigger again. Do you want to look down and use cox regression? But I still suggest reading it. Why? I believe everyone knows parametric test and nonparametric test, and may prefer parametric test, such as T test, to nonparametric test, such as rank sum test. Then the Weibull regression and cox regression here can basically be said to correspond to parametric test and nonparametric test respectively. I also introduced the advantages and disadvantages of parametric test and nonparametric test in my last article. If the data conforms to Weibull distribution, it is of course the most ideal choice to directly apply Weibull regression, which can give you the most reasonable estimate. If the data does not conform to the Weibull distribution, then if Weibull regression is used and the error is applied, the result will definitely be untrue. Therefore, if you can judge whether your data conforms to the Weibull distribution, it is of course best to use parametric regression, that is, Weibull regression. But if you really don't have the confidence to judge the data distribution, you can also honestly use cox regression. Cox regression can be regarded as nonparametric and can be used regardless of data distribution, but because it can be used for any data, there is an inevitable disadvantage that every data is useless. Weibull regression is like a tailor, taking body shape as data and clothes as model. Weibull's return is to make clothes according to your body shape, which will definitely suit you, but not others. Cox's return is like going to the mall to buy clothes. Clothes fit many people, but not everyone. Let's just say it's generally appropriate. As for whether to choose the troublesome way of tailoring or simply go to the mall to buy ready-made ones, it depends on your preferences and your understanding of your body shape. If you are very familiar with it, of course, clothes will be tailored for you. If you don't know much about it, go to the mall to buy fashionable clothes.
8. Principal component regression. Principal component regression is a synthetic method, which is equivalent to the synthesis of principal component analysis and linear regression. It is mainly used to solve the situation that there is a high correlation between independent variables. This is not uncommon in reality. For example, there are both blood pressure and blood sugar in the independent variables you want to analyze, and these two indicators may have certain correlation. If they are put into the model at the same time, it will affect the stability of the model and sometimes cause serious consequences, such as the results are seriously inconsistent with the actual situation. Of course, there are many solutions, and the simplest one is to eliminate one of them. But if you really can't bear to part with it, after all, it is a painstaking investigation, and it would be a pity to delete it. If you can't bear it, you can consider using principal component regression, which is equivalent to expressing the information contained in these two variables with one variable. This variable is called principal component, so it is called principal component regression. Of course, if one variable replaces two variables, it is definitely impossible to completely contain their information, just 80% or 90%. But sometimes we have to make a choice. Do you want a model with 100% information but many variables? Or a model with 90% information but only 1 or 2 variables? For example, if you want to diagnose a cold, do you want to finish all the symptoms and test results related to the cold? Or simply judge by several symptoms? I think according to several symptoms, 90% is a cold. It's not necessarily 100% information, is it? The same is true of models, which are used for reality, not castles in the air. Since it is to be used in practice, it must be simple. For a disease, if 30 indicators can diagnose 100% and 3 indicators can diagnose 80%, I think everyone will choose the model with 3 indicators. This is the basis of principal component regression. Several simple variables are used to synthesize the information of multiple indicators, so that several simple principal components may contain most of the information of many original independent variables. This is the principle of principal component regression.
9. Ricky is back. I haven't looked up the origin of the name ridge return, perhaps because its figure is a bit like a ridge. Don't dwell on the name. Ridge regression is also used to deal with the high correlation between independent variables. It is just different from the specific estimation method of principal component regression. The calculation of linear regression uses the least square estimation method. When the independent variables are highly correlated, the parameter estimation value of least square regression estimation will be unstable. At this time, if something is added to the formula to make it stable, the problem will be solved. Ridge regression is this idea, adding a k to the least square estimation, changing its estimation value and making the estimation result stable. How big should k be? Judging from the ridge trace map, it is estimated that this is the origin of the ridge regression name. You can choose many values of k, and you can make a ridge map to see which value becomes stable, and then you can determine the value of k, and then solve the problem of unstable estimation of the whole parameter.
10, partial least squares regression. Partial least squares regression can also be used to solve the problem of high correlation between independent variables. But one advantage better than principal component regression and ridge regression is that partial least squares regression can be used in the case of few cases, even when the number of cases is less than the number of independent variables. It sounds incredible, doesn't it mean that the number of examples is better than the number of independent variables 10 times? How can the number of cases be less than the independent variable? How to calculate this? Unfortunately, partial least squares regression really has such heinous advantages. Therefore, if your independent variables are highly correlated, the number of cases is particularly small, and there are many independent variables (so many helpless questions), then you don't have to worry now, just use partial least squares regression. In fact, its principle is a bit like principal component regression, which also extracts the information of some independent variables, which loses some accuracy, but ensures that the model is more in line with reality. Therefore, this method is not directly analyzed by dependent variables and independent variables, but by new comprehensive variables that reflect part of the information of dependent variables and independent variables, and there is no need for more information than independent variables. Partial least squares regression has another great advantage, that is, it can be used in the case of multiple dependent variables. Ordinary linear regression has only one dependent variable, while partial least squares regression can be used to analyze multiple dependent variables and multiple independent variables. Because its principle is to extract the information of multiple dependent variables and multiple independent variables at the same time to form new variables for reanalysis, so multiple dependent variables are irrelevant to it.
After reading the above explanation, I hope it will help you understand the application of regression analysis.
The above is Bian Xiao's understanding and simple application of regression analysis. For more information, you can pay attention to Global Ivy and share more dry goods.