The fact is that there are countless regression forms that can be used. Each form of regression has its own importance and specific scenarios that are most suitable for application. In this article, I will explain the seven most commonly used regression forms in data science in simple terms. Through this article, I also hope that people can have a concept of the breadth of regression, not just linear/logical regression for every problem they encounter, and I hope they can use so many regression techniques!
If you are a novice in data science and are looking for a place to start studying, then the course "Data Science" is a good starting point! It covers the core topics of Python, statistics and predictive modeling, and is the best way for you to take the first step in data science.
What is regression analysis?
Regression analysis is a predictive modeling technology, which studies the relationship between dependence (target) and independent variables (predictive variables). This technology is used for forecasting, time series modeling and finding the causal relationship between variables. For example, the relationship between reckless driving and the number of road traffic accidents of drivers can be best studied by regression method.
Regression analysis is an important tool for modeling and analyzing data. Here, we fit curves/straight lines to data points to minimize the distance difference between data points and curves or straight lines. I will explain this point in detail in the next chapter.
Why do we use regression analysis?
As mentioned above, regression analysis is to estimate the relationship between two or more variables. Let's understand this through a simple example:
For example, you want to estimate the company's sales growth rate according to the current economic situation. You have the latest company data showing that the sales growth is about 2.5 times of the economic growth. With this insight, we can predict the company's future sales based on current and past information.
There are many advantages to using regression analysis. As follows:
It shows the significant relationship between dependent variables and independent variables. It represents the influence intensity of multiple independent variables on a dependent variable.
Regression analysis also allows us to compare the effects of variables measured on different scales, such as price changes and the number of promotional activities. These advantages help market researchers/data analysts/data scientists to eliminate and evaluate the best variable set for establishing forecasting models.
How many regression techniques do we have?
We have various regression techniques that can be used for forecasting. These techniques are mainly driven by three indicators (the number of independent variables, the type of dependent variables and the shape of the regression line). We will discuss them in detail in the following chapters.
For creativity, if you feel the need to use a combination of the above parameters, you can even make a new regression, which people have never used before. But before we begin, let's take a look at the most commonly used regression:
1. Linear regression
This is one of the most widely known modeling techniques. Linear regression is usually one of the first methods people choose when learning predictive modeling. In this method, the dependent variable is continuous, the independent variable can be continuous or discrete, and the property of the regression line is linear.
Linear regression uses the best-fit straight line (also called regression line) to establish the relationship between the dependent variable (y) and one or more independent variables (x).
It is expressed by the equation Y = a+b * X+e, where a is the intercept, b is the slope of the straight line, and e is the error term. This equation can predict the value of the target variable according to the given predicted variable.
The difference between simple linear regression and multivariate linear regression is that multivariate linear regression has (>: 1) independent variables, while simple linear regression has only 1 independent variables. The question now is "How do we get the best fitting line?" .
How to get the best fitting line (values of a and b)?
This task can be easily accomplished by least square method. This is the most commonly used method to fit the regression line. It calculates the best fitting line of observation data by minimizing the sum of squares of vertical deviation from each data point to a straight line. Because we have to square the deviation first, there is no offset between positive and negative values when adding.
We can use the R square of the metric to evaluate the model performance.
Important: There must be a linear relationship between independent variables and dependent variables. Multiple regression has many problems, such as multiple linearity, autocorrelation and heteroscedasticity. Linear regression is very sensitive to outliers. It will greatly affect the regression line and ultimately affect the predicted value. Multiple * * * linearity will increase the variance of coefficient estimation, making the estimation very sensitive to small changes in the model. As a result, the coefficient estimation is unstable. In the case of multiple independent variables, we can choose the methods of forward selection, backward elimination and gradual elimination to select the most important independent variable. 2. Logistic regression
Find the probability of success and the probability of failure by logistic regression. When the dependent variable is binary in nature (0/ 1, true/false, yes/no), we should use logistic regression. Here, the value of y ranges from 0 to 1, which can be expressed by the following equation.
Odds = p /( 1-p)= incident probability/non-incident probability ln (odds) = ln (p/(1-p)) logit (p) = ln (p/(1-p)) = B0+B6.
Above, p is the probability of interesting features. At this time, you have to ask a question, "Why do we use logarithmic log in the equation?" .
Because we use binomial distribution (dependent variable) here, we need to choose the link function that is most suitable for this distribution. Moreover, it is a logit function. In the above equation, this parameter is selected to maximize the possibility of observing sample values, rather than to minimize the sum of squares of errors (such as ordinary regression).
Emphasis: it is widely used in logistic regression of classification problems and does not depend on the linear relationship between dependent variables and independent variables. It can deal with various relationships because it has the advantage of applying nonlinear logarithmic transformation to prediction. In order to avoid over-fitting and under-fitting, we should include all important variables. A good way to ensure this is to estimate logistic regression by stepwise method, which requires a large sample size, because when the sample size is small, the efficiency of maximum likelihood estimation is lower than that of ordinary least square method. Independent variables should not be correlated, that is, they do not have multiple * * * linearity. However, we can choose to include the interaction of classification variables in the analysis and model. If the value of the dependent variable is ordinal, it is called ordinal logistic regression; If the dependent variable is multi-class, it is called multivariate logistic regression. 3. Polynomial regression
If the power of the independent variable is greater than 1, the regression equation is a polynomial regression equation. The following equation represents a polynomial equation:
Y = A + B * X ^ 2
In this regression technique, the best fitting line is not a straight line. It is a curve that coincides with the data point.
Important: Although it may be tempting to fit higher-order polynomials to obtain lower errors, it may lead to over-fitting. Always draw a picture to see if it matches, and focus on ensuring that the curve conforms to the essence of the problem. Here is an example of how painting can help: pay special attention to the curves at the end to see if these shapes and trends make sense. Higher order polynomials will eventually produce strange results. 4. Step by step regression
This form of regression is used when we deal with multiple independent variables. In this technology, the selection of independent variables is completed with the help of automatic process, without manual intervention.
This feat can be determined by observing statistical values, such as R-squared, T-test and AIC index. Stepwise regression is basically suitable for regression model, which adds/deletes one covariant at a time according to the specified criteria. The following are some of the most commonly used stepwise regression methods:
The standard is gradually returning to do two things. It adds and deletes predictive variables according to the needs of each step. The forward selection starts with the most important predictive variables in the model and adds variables for each step. Backward elimination starts with all the predicted variables in the model, and the least important variables are deleted in each step.
The purpose of this modeling technique is to achieve the maximum predictive ability with the least predictive variables. It is one of the methods to deal with high-dimensional data sets.
5. Ridge regression
Ridge regression is a technique used when data has multiple linearities (independent variables are highly correlated). In multiple * * * linearities, even if the least squares estimators (OLS) are unbiased, their variance is large, which makes the observed values deviate from the true values. Ridge regression can reduce the standard error by adding a certain degree of deviation to the regression estimation.
Above, we see the equation of linear regression. Remember? It can be expressed as:
y = a + b * x
This equation also has an error term. The complete equation becomes:
Y = a+b * x+e (error term), [error term is the value needed to correct the prediction error between the observed value and the predicted value] represents multiple independent variables, = >; y = a + y = a + b 1x 1 + b2x2 +....+ e .
In the linear equation, the prediction error can be decomposed into two sub-components. First, because of deviation, and second, because of variance. Due to either of these two or two components, prediction errors may occur. Here, we will discuss the error caused by variance.
Ridge regression solves multiple * * * linear problems by narrowing the parameter λ(lambda). Look at the equation below.
In this equation, we have two components. The first is the least square term, and the second is λ (the square of β) of the sum of β2, where β is the coefficient. This is added to the least square term in order to narrow the parameter range and make it have a very low variance.
Key point: The assumption of this regression is the same as that of least squares regression, but it will reduce the value of the coefficient without assuming normality, but it will not reach zero, indicating that there is no feature selection function. This is a regularization method, using l2 regularization. Step 6 lasso regression
Similar to ridge regression, Lasso (minimum absolute contraction and selection operator) also limits the absolute size of regression coefficient. In addition, it can reduce the variability of linear regression model and improve its accuracy. Look at the following equation:
The difference between lasso regression and ridge regression is that it uses absolute value instead of square in penalty function. This leads to the penalty value (or equivalently constrains the sum of the absolute values of the estimated values), which leads to the estimated values of some parameters being exactly zero. The greater the penalty applied, the smaller the estimated value, which is close to absolute zero. This results in the selection of variables from the given n variables.
Key points: Regression assumption is the same as least square regression, but it does not assume normality. It reduces the coefficient to zero (exactly zero), which is certainly helpful for feature selection. This is a regularization method, which is regularized with l 1 If the prediction variables are highly correlated, Lasso only selects one of them and reduces the other predictions to zero. 7. Elastic network regression.
Elastic network regression is a mixture of lasso regression and ridge regression technology. It uses L 1 and L2 priors as regularization operators for training. Elastic networks are very useful when there are multiple related elements. One lasso may be randomly selected, while two elastic nets may be selected at the same time.
One practical advantage of balanced lasso regression and ridge regression is that it allows elastic networks to inherit some stability of ridge regression under rotation.
Key point: when variables are highly correlated, encourage group effect. There is no limit to the number of selected variables, which will be affected by double contraction. How to choose the correct regression model?
Life is usually simple when you only know one or two skills. A training institution I know tells its students to use linear regression if the results are continuous. If it is binary-then use logical regression! However, the more options we can use, the more difficult it is to choose the right option. A similar situation will happen to the regression model.
Among many types of regression models, it is very important to choose the most suitable regression method according to the types of independent variables and dependent variables, the dimensions in the data and other basic characteristics of the data. The following are the key factors to choose the correct regression model:
Data mining is an inevitable part of establishing forecasting model. Before choosing a suitable model, we must first determine the correlation coefficient and influence between variables. In order to compare the goodness of fit of different models, we can analyze different indicators, such as statistical significance of parameters, R square, adjusted R square, AIC index, BIC index and error term. The other is Mallow's Cp standard. This is basically to check possible deviations in the model by comparing the model with all possible submodels (carefully selecting them). Cross-validation is the best way to evaluate the model used for prediction. Here, the data set can be divided into two groups (training and verification). The simple mean square deviation between the observed value and the predicted value can measure the accuracy of the prediction. If your dataset has multiple confounding variables, you should not choose the automatic model selection method because you don't want to put them into the model at the same time. It also depends on your goal. Compared with the model with high statistical significance, the model with weak function is easier to realize. Regression regularization methods (lasso regression, ridge regression and elastic network regression) work well when the variables in the data set have high dimensions and multiple * * * linearities. Concluding remarks
Up to now, I hope you know something about the return. These regression techniques are applied in consideration of data conditions. One of the best techniques to determine which technique to use is to check the family of variables, that is, discrete variables or continuous variables.
In this paper, I discuss seven types of regression and some key facts related to each technology. As a newcomer in this industry, I suggest you learn these technologies and implement them in your model.
-The above are seven regression models recommended by the author. If you are interested in these seven models, try them yourself. It is not enough to know the theory. Only by doing more experiments can we really master these models.
7 regression techniques you should know!