Current location - Education and Training Encyclopedia - Graduation thesis - 0 1 KNN algorithm-overview
0 1 KNN algorithm-overview
The full name of KNN algorithm is K nearest neighbor (KNN).

KNN is a basic machine learning algorithm. The so-called k nearest neighbors are the k nearest neighbors. That is, each sample can be replaced by samples in the nearest k adjacent positions.

KNN is a relatively simple algorithm, which is simpler than the regression algorithm and classification algorithm mentioned before. If a person has never been exposed to the algorithm of machine learning, the simplest classification method after obtaining the data is K nearest neighbor. For example, if you want to know what kind of person I am, and then you find that my closest friends are all funny, you can acquiesce that I am also a funny.

KNN algorithm is suitable for both classification algorithm and regression algorithm.

The main difference between KNN regression and classification is that it makes different decisions in the final prediction. When classifying and forecasting, the majority voting method is generally adopted. When making regression prediction, the average method is generally used.

Majority voting method: when classifying, which samples are closer to my target sample, that is, which classification sample the target sample is closer to.

Average method: predict the average height of one sample and observe the average height of other samples around the target sample. We think that the average height is the height of the target sample.

Give another example:

According to the two characteristics of sweetness and crispness, the types of food are judged respectively.

According to the sample, we generally find that:

Sweeter and crisper food is fruit.

The food that is neither sweet nor crisp is protein.

Not sweet, crisp food is vegetables.

So we can classify the targets according to their sweetness and brittleness.

Selection of k value:

First, select a smaller value, and then select a suitable final value through cross-validation.

The smaller k is, that is, the smaller the sample is used for prediction, the training error will be reduced, but the model will be complicated and over-fitted.

The larger the k, that is, the larger the training error will be, the simpler the model will be, and it will easily lead to under-fitting.

Measurement of distance:

Euclidean distance: Euclidean measurement (also known as Euclidean distance) is a common definition of distance, which refers to the real distance between two points in M-dimensional space, or the natural length of a vector (that is, the distance from the point to the origin). Euclidean distance in two-dimensional and three-dimensional space is the actual distance between two points.

Decision planning:

Classification: majority voting method and weighted majority voting method.

Regression: average method and weighted average method.

Weighted majority voting method:

Average method and weighted average method:

Also look at the figure, the values of the upper three samples are 3, and the values of the lower two samples are 2. What is the forecast? The value.

If weighting is not considered, calculate the average directly:

(3 * 3 + 2 * 2) / 5 = 2.6

Weighted average: the weights are 1/7 and 2/7 respectively. Calculate the weighted average:

(3 * 3* 1/7 + 2 * 2 * 2/7) / 5 = 2.43

1, pretty:

Calculate the distances from the predicted samples to all the training set samples, and then select the smallest k distances to get the k nearest points.

Disadvantages: When there are many features and samples, the efficiency of the algorithm is relatively low.

2.KD tree (kd_tree):

Firstly, the training data is modeled and KD tree is constructed, and then the adjacent sample data are obtained according to the constructed model.

The following content will introduce the way to find the minimum value of KD tree, so that everyone can intuitively feel how much data KD tree needs to retrieve compared with brute force implementation.