Current location - Education and Training Encyclopedia - Graduation thesis - KNN data filling algorithm
KNN data filling algorithm
KNN uses the correlation of data in different dimensions to fill in and correct missing or abnormal values in data.

The data set discussed in this paper comes from the change of air pollutant concentration values measured at various stations in a certain place with time, and there is data deficiency in some places or at some moments. We know that in this batch of data, the concentration values of measuring points are related in distance and time, that is, the closer the spatial distance is, the more relevant the values measured by the measuring points with closer time are. Therefore, KNN algorithm can be used to process data from three dimensions: longitude, latitude and time.

In the above picture, we don't get the measured value of the target point at a certain moment, but we can get some measured values around it ...? , so that we can use the existing data to estimate the target value c_x:

The weight is inversely proportional to the distance between the adjacent point and the target point, for example:

The relationship between weight and distance can be defined in practical use.

When using KNN algorithm to fill data, we need to find the nearest neighbor of each sample, so we need to calculate the distance between different samples first, which can be solved by using NearestNeighbors in sklearn.neighbors.

Nbrs = nearest neighbors (n _ neighbors, algorithm = 'ball_tree'). Fit (x)

Distance, exponent = nbrs.kneighbors(X)

After the distance matrix is obtained, the distance between each sample and other samples can be calculated, and the corresponding estimated value can be calculated by using the previous formula. It should be noted that the sample distance refers to the Euclidean distance of the sample in the specified dimension, and all the samples in the specified dimension satisfy the correlation between the distance and the measured value. For example, we can take the latitude and longitude of the sample and the measurement time as the dimensions to calculate the sample distance, so that the closer the spatial distance around the target point is to the measurement time, the greater the influence on the estimated value of the target point.