Classical machine learning series integrated learning

There is an old saying in China that "two heads are better than one". It is said that many hands make light work, but there is also an idiom called "rabble". There is also an algorithm in machine learning, which combines these two ideas and takes its essence. It is ensemble learning, and the algorithm combines different learners.

In ensemble learning, the algorithm does not require every learner to have the best performance, but expects them to have different views on the problem, good but different.

If it is described in the classification problem, it shows that its classification ability is different. For some samples, learners can divide, while for others, learners can divide. A single learner is not required to have the ability to divide all samples.

In technical terms, different learners have different preference models, but each one is a weakly supervised model. Ensemble learning combines several weak supervision models to get a good strong supervision model. The idea is that different learners correct each other's mistakes in order to achieve the ultimate goal of improving accuracy.

Integrated learning, called Ensemble learning in English, achieves the purpose of learning by integrating multiple learners. Finite models are mainly combined with each other, and their names sometimes have different names, sometimes called multi-classifier system, committee learning, modular system, classifier fusion, combination, aggregation and so on. These concepts are interrelated, but there are some differences, and the industry has not yet reached a * * * understanding of the definition of concepts. The performance of the whole algorithm is very strong, and it is the first choice for many high-level competitions (knowledge discovery and data mining, Kaggle).

In machine learning, the hypotheses that meet the training set may not perform as well in practical application, so the learning algorithm faces certain risks when choosing which hypothesis to output, and integrating multiple hypotheses can reduce this risk (which can be understood as offsetting the error between each hypothesis and the target hypothesis to some extent through integration).

In Zhou Zhihua Watermelon Book, it is proved by Hoeffding inequality that with the increase of the number of individual classifiers in integration, the error rate of integration will decrease exponentially and eventually tend to zero.

Integrated learning first produces a group of "individual learners", and then combines them through some strategies. According to whether each individual learner adopts the same learning algorithm, it can be divided into homogeneous fusion and heterogeneous fusion.

The performance of comprehensive learners is better than that of individual learners, who need to meet two good but different requirements:

The first condition is relatively easy to realize. Just training a model under the current problem, the result is better than the result of wild guess. The second condition is the core issue of comprehensive learning research. Every individual learner learns the same problem, so individual learners can't be completely independent of each other. Think about the teacher asking you to express different opinions when you were a child, and think about finding innovation when writing a paper. It is difficult for everyone to do such a thing, let alone just a small learning algorithm.

If we want to enhance the diversity of individual learners on the premise that individual learners are excellent enough, we can imagine intuitively. The whole algorithm learning process is from data to model to output.

Consider the input first. If each learner learns a different sample, then he can learn a relatively different individual learner. So the question now is how to divide the training samples. You can choose them randomly, or train different individual learners with different subsets of attributes.

Secondly, considering the model, if the basic learner's model is different, different individual learners can be trained.

Finally, considering the output, different individual learners can be obtained if they are divided according to the characteristics of labels.

According to the above three concepts, there are mainly the following five methods:

Different sample subsets are generated from the original training samples, and then different individual learners are trained by using different sample subsets. Such as self-service sampling used in bagging and sequential sampling used in pressurization.

This method of training sample disturbance is simple and efficient, but it is only effective for unstable basic learners, such as decision trees and neural networks. For stable basic learners, such as linear learners, support vector machines, naive Bayes and K-NN, the effect is not obvious. The reason for this problem is that the "flexibility" of stable basic learners is not very strong.

Speaking of Bagging and Boosting, here are two classic methods: ensemble learning is divided into serialization method-boosting, which has strong self-correlation and must be generated in series, and parallelization method-bagging, which has no strong dependence and can be generated at the same time.

The specific implementation method is: first, give each training sample the same weight, then train the first basic classifier and use it to test the training set, increase the weight of those test samples that are wrongly classified (in the actual algorithm, reduce the weight of correctly classified samples), then train the second basic classifier with the adjusted weighted training set, and then repeat this process until a good enough learner is finally obtained.

The most famous algorithm in Boosting is AdaBoost proposed by Yoav Freund in 1997. The following figure shows the results of Bing's academic search for AdaBoost papers:

Taking the deduction process of Zhou Zhihua Watermelon Book as an example, this paper uses the "addition model" to analyze:

The linear combination of basic learners is expressed as follows:

Define the loss function of the whole learner as an exponential loss function, and expect the exponential loss function to be minimized:

Where is a real function, representing the weight distribution of samples (the weight of wrong samples is higher, the weight of correct samples is lower, and all samples together are equivalent to a distribution).

If the linear combination of basic learners can minimize the exponential loss function, the general method is to find the partial derivative to make it equal to zero, and then solve it. Because there are only two values, the result after partial derivative is as follows:

Let its partial derivative be 0, and the solution is:

There are:

This means that if the exponential loss function is minimized, the classification error rate will also be minimized. Exponential loss function is the replacement function of the original task, but because of its continuously differentiable, it is used to replace 0/ 1 loss function as the optimization objective. So much above means that this continuous exponential loss function is used for further processing.

In AdaBoost algorithm, the first basis classifier is obtained by directly applying the basis learning algorithm to the initial data distribution. Subsequent sums are generated by iteration. When generating a basic classifier based on distribution, the weight of the basic classifier should minimize the exponential loss function. Only by giving less weight to the wrong basis classifier and more weight to the correct basis classifier can we make a more accurate judgment and minimize the exponential loss function.

Among them, it is actually the misjudgment rate. In order to obtain the weight of the basic classifier, the derivation is made:

Let the derivative be 0, and you can get:

This is equivalent to the completion of adaptation. Here AdaBoost's adaptive thought adopts the method of weighted majority voting. The above formula reflects the increase of the weight of weak classifiers with small classifier error rate, which makes them play a greater role in voting. The greater the error rate, the opposite is true.

Now we must return to the sample processing in the Boost principle. When changing the weight, or probability distribution, of this sample, the intuitive idea we want to realize is to increase the weight of those samples that were wrongly classified by the weak classifier in the last round and reduce the weight of those samples that were correctly classified. Next, let's prove this formula:

Here, it is proved by basic learners to see what kind of sample distribution, and basic learners can learn to minimize classification errors.

After AdaBoost is obtained, adjust the sample distribution, so that it can learn things that basic learners can't learn before, and correct some mistakes, then this can be minimized:

Note that the Taylor expansion that can be used in the above formula approximates to the following formula:

So the ideal basic learner:

Please note that it is a constant. Let's represent a distribution:

According to the definition of mathematical expectation, it is equivalent to making:

By,,, there are:

So the ideal basic learner:

Therefore, in the case of distribution, it is ideal to minimize the classification error. The relationship between and is:

The above formula is the updated formula of step 7 of AdaBoost in the following figure, and the whole AdaBoost algorithm is shown in the following figure:

The fifth line of AdaBoost algorithm checks whether the current base classifier is better than random guess. Once the conditions are not met, the current basic learners are abandoned and the learning process stops. Under this requirement, there may be too few basic learners involved in integration, resulting in poor overall performance. Resampling is used to deal with it, that is, in each round of learning, the training set is resampled according to the sample distribution, and then the basic learners are trained with the sample set obtained by resampling, so as to be restarted.

It is a famous representative of parallel integrate learning method. On the basis of bootstrap sampling, given a data set containing 10 samples, put the random samples back, and get a sample set containing 10 samples after several times. In this way, there are about 10 samples in the initial training set.

In this way, a sample set containing 10 training samples is sampled, and then a base learner is trained based on each sample set, and then these base learners are combined. Bagging usually uses simple voting method to classify tasks when forecasting output. Use simple average method for regression tasks.

The above figure shows the sample distribution generated by self-service sampling.

Input attribute perturbation is usually to extract several attribute subsets from the initial attribute set and then train different individual learners with different attribute subsets. For example:

On the basis of constructing Bagging ensemble with decision tree-based learners, RF further introduces random attributes into the training process of decision trees. When traditional decision trees choose partition attributes, they choose an optimal attribute from the attribute set of the current node. In RF, for each node of the base decision tree, a subset containing 20 attributes is randomly selected from the attribute set of the node, and then an optimal attribute is selected from this subset for division.

The diversity of basic learners in random forest comes not only from sample disturbance, but also from attribute disturbance, so that the generalization performance of the final ensemble can be further improved by increasing the differences between individual learners.

However, this input attribute perturbation method is only effective for data sets with a large number of redundant attributes, and it is not applicable if the data set contains only a few attributes or few redundant attributes. The performance of random forest is slightly worse than Bagging because of introducing attribute disturbance at first, but with the increase of individual number, random forest usually converges to a lower generalization error.

Parameter perturbation of the algorithm refers to training learners with large individual differences by randomly setting different parameters. The number of hidden neurons and initial connection weights of the neural network shown in the figure below are different.

This method is effective for algorithms with more parameters, but for algorithms with fewer parameters, some links in the learning process can be replaced by other similar methods? So as to achieve the purpose of disturbance. This may also be a point of sending a paper. I may not use this algorithm in the future, so I won't do algorithm research.

The output label perturbation is to change the category label of the training sample slightly, and transform the original multi-classification problem into multiple binary classification problems randomly to train the basic learner. A classic algorithm is error correction output code (ECOC).

Each class corresponds to a binary bit string of length n (called code word), * * * forms m code words, and the same bit of these code words describes a binary function. After learning, n bisectors are obtained. In the classification stage, each bisector forms an output vector for the output of input samples, and then the classification of input samples is determined by decision rules.

This method is effective for data sets with enough categories, but not for data sets with fewer categories.

Mixed perturbation uses the above-mentioned multiple perturbation methods simultaneously in the same integrated algorithm. For example, random forest uses both training sample perturbation and input attribute perturbation.

The above five points discuss how to produce good and different individual learners. After cultivating excellent and different individual learners, how can we combine these strategies? There are mainly average method and ordinary voting method, including:

Simply average the output results.

Multiply it by the weight coefficient and add it up.

In other words, if a tag gets more than half of the votes, it is classified as the tag, otherwise it is rejected.

Classified as the symbol with the most votes. If there are multiple symbols with the highest votes at the same time, choose one randomly.

Assign weights to the category labels predicted by each individual learner and classify them as the labels with the greatest weight. The weight here is usually the classification confidence (class membership probability) of a single learner.

What should college students do?

On the problem of pushing strings of electric guitar

Experimental principle of production technology of yogurt fruit and vegetable granules

What did the introduction of sociology give me? 1000 words.

How to view user comments on the right in Thunder?

What is the function of internal audit?

Title of Qinhuangdao tourism paper

An article thanking family members.

Which is better, tea withered powder or arborvitae leaves?

What is the postcode of Shuangliu District, Chengdu?