Current location - Education and Training Encyclopedia - Graduation thesis - One of the Core Algorithms of Data Mining —— Regression
One of the Core Algorithms of Data Mining —— Regression
One of the Core Algorithms of Data Mining —— Regression

Regression is a broad concept, including the basic concept of using one set of variables to predict another. In the vernacular, several things predict the probability of another thing according to the degree of correlation. The simplest problem is the linear binary problem (simple linearity). For example, my wife bought a bag this afternoon, but I didn't. As a result, I definitely didn't have dinner. A little more complicated is multivariable (i.e. multivariate linearity), and there is one thing to pay attention to here, because I made this mistake before, that is, I think that the more variables to be predicted, the better. When modeling, I always want to select dozens of indicators to predict, but you know, on the one hand, every additional variable is equivalent to increasing the error in this variable, which magnifies the overall error in disguise, especially when the independent variables are not properly selected, on the other hand, when the two independent variables are highly correlated but not independent. ) or the above example, if the mother-in-law comes, then the wife has a high probability of cooking; If something happens again, if my father-in-law comes, then my wife will definitely cook; Why do you have these judgments? Because these things have happened many times before, I can predict whether my wife will cook dinner according to these things.

Of course, the problems in the era of big data are invisible to the naked eye, otherwise it is useless to have massive calculations, so in addition to the above two kinds of regression, we often use polynomial regression, that is, the relationship between models is n-order polynomial; Logical regression (similar methods are decision trees), that is, the result is the prediction of classified variables; Poisson regression, that is, the result variable represents the frequency; There are too many nonlinear regression, time series regression, autoregressive regression and so on. Here are some commonly used models that are easy to explain (all models should pay attention to one problem, that is, to explain well, whether it is parameter selection, variable selection or results, because after the model is established, the business personnel will be used at last, and the boss will see the results, so you should explain clearly. If you say this is the result, I don't know what to ask, so promotion and salary increase are basically hopeless. For example, if you find that the sunshine time is directly proportional to the grape sales in a certain place, you may have to explain why it is directly proportional. Further statistics show that sunshine time is related to the sugar content of grapes, that is, long sunshine time is related to the yield, long sunshine time, large yield and naturally low price. In this way, the sales of cheap and delicious grapes will definitely be great. Let's give another example. If the coffee sales in an oil-producing area increase, the international oil price will fall. There is a relationship between the two. You should not only tell the leaders that they are related, but also find out why. Coffee is the main drink to improve workers' energy. With the increase of coffee sales, it is found that the work intensity of workers increases, the export of oil transportation increases, and the relationship between the decline of oil price and coffee sales comes out (for a simple example, don't think too much, refer to a ship based on remote sensing information.

The sharp weapon of regression, the least square method, is used by the awesome mathematician Gauss (another French mathematician said that he founded it first, but there is no way to know who made Gauss famous). This method mainly finds the relationship between the sample and the prediction according to the sample data, so as to minimize the error sum between the prediction and the real value; It is similar to the example of my wife cooking dinner mentioned above, but my example only talks about the high probability from the perspective of uncertainty, but what is the probability? This relationship is written by the least square method. Let's not talk about least squares and formulas here. Just use tools. Basically, all data analysis tools provide the function of this method, mainly to tell you a misunderstanding before. The least square method can solve the equation in any case. Because this method only minimizes the sum of errors, even a large error, as long as it is the smallest sum of errors, is the result of this method. At this point, you should know what I want to say, even if the independent variable has nothing to do with the dependent variable, this method will also calculate a result, so I mainly tell you about the requirements of the least square method for the data set:

1, normality: for a fixed independent variable, the dependent variable is normal, which means that most of the reasons are centralized for the same answer; In order to establish a regression model, we use a large number of Y~X mapping samples for regression. If the sample of y is messy, it can't be returned.

2. Independence: Y of each sample is independent of each other. This is easy to understand. There can't be any connection between answers, just like flipping a coin. If the first time is tails and you are asked to predict the probability of throwing twice, there is no need to predict the result.

3. linearity: that is, x and y are related. In fact, everything in the world is connected. Butterflies and tornadoes (or tsunamis) are related, but only directly or indirectly. Correlation here refers to the direct correlation between independent variables and dependent variables.

4. Homovariance: the variance of the dependent variable does not change with the level of the independent variable. Variance, which I wrote in descriptive statistical analysis, represents the variability of the data set, so the requirement here is that the variability of the results is constant. For example, my head is spinning and I can't think of an example. Draw a picture to illustrate it. (We hope that the corresponding result of each independent variable is in a range as small as possible. )

We use regression method to model and try to eliminate the influence of the above points. Let's talk about the simple regression process in detail (the others are actually similar, so you can make this clear, and the others are similar):

First, find indicators and find the relevant indicators of the variables you want to predict (the first step should be to find out what the variables you want to predict are. This topic is a bit big, involving your business goals, the boss's goals, the most critical business indicators to achieve this goal and so on. What we are talking about is the following topic, let's make the method clear here) and find the relevant indicators. The standard practice is that business experts generate some indicators, and we are testing which of these indicators are highly relevant. But most of the business people I have experienced are unreliable at the initial stage of modeling (really unreliable, no idea, no idea, no opinion), so my approach is to take all the relevant indicators of business purposes (sometimes hundreds), then run a correlation analysis, and then through a principal component analysis, the filter is almost the same, and then show it to business experts. At this time, they will have ideas (something must be activated first) and will give some. Predictive variables are the most important and directly related to your results and output, so this is a multi-round optimization process.

Second, looking for data, this is not much to say, either according to the time axis (I think it is a better way, most of them are regular) or according to the cross section, which means that different points of the cross section may fluctuate greatly, so be careful; Meanwhile, basic data processing should include extreme value processing and null value processing.

Thirdly, establish a regression model, which is the simplest step. All mining tools provide various regression methods, and your task is to tell the computer what you have prepared.

Fourth, test and modify. The model calculated by our tools has various hypothesis test coefficients. You can immediately see the quality of your model and modify and optimize it at the same time. Here, it mainly involves a precision ratio, which represents the truly correct proportion of the prediction part; The other is the recall rate, which represents the probability that all truly correct examples are predicted; Generally speaking, precision and recall are inversely proportional, so we need to find a balance point.

Fifth, explain and use, this is the moment to witness the miracle. There is usually a long time before witnessing. It's time for you to explain to your boss or customers why there are these variables, why we choose this balance point (because of lack of business strength or other reasons), why things that have been done for so long are so poor (which is embarrassing) and so on.

Let's talk about so many regressions first, principal component analysis and correlation analysis in the next round, and then another sharp weapon of data mining-clustering.