Current location - Education and Training Encyclopedia - Graduation thesis - Overview of distance calculation methods
Overview of distance calculation methods
When classifying, it is often necessary to estimate the similarity measure between different samples. At this time, the usual method is to calculate the "distance" between samples. What kind of method is used to calculate the distance is very particular, even related to the correctness of classification.

The purpose of this paper is to summarize the commonly used similarity measures.

Euclidean distance is the most understandable distance calculation method, which comes from the distance formula between two points in Euclidean space. \

(1) Euclidean distance between two points a (x 1, y 1) and b(x 2, y 2) on the two-dimensional plane;

(2) Euclidean distance between two points a(x 1, y 1, z 1) and b(x 2, y 2, z 2) in three-dimensional space:

(3) Euclidean distance between two N-dimensional vectors A (X 1 1, X 12, …, X 1N) and b(x 2 1, x 22, …, x 2n):

You can guess the calculation method of this distance from the name. Imagine that you have to drive from one intersection in Manhattan to another. Is the driving distance between two points a straight line? Obviously not, unless you can get through the building. The actual driving distance is this "Manhattan distance". This is also the origin of the name Manhattan distance, also known as the city block distance.

(1) Manhattan distance between two points A (X 1, Y 1) and b(x2, y2) on the two-dimensional plane.

(2) Manhattan distance between two n-dimensional vectors A (X 1 1, X 12, …, X 1N) and b(x2 1, x22, …, x2n).

Have you ever played chess? The king can move to any of the eight adjacent squares in one step. So how many steps does the king need to take to get from the grid (x 1, y 1) to the grid (x2, y2)? Try to walk by yourself. You will find that the minimum number of steps is always Max (| x2-x 1 |, | y2-y 1 |). There is a similar distance measurement method called Chebyshev distance.

(1) Chebyshev distance between two points a (x 1, y 1) and b(x2, y2) on the two-dimensional plane.

(2) Chebyshev distance between two n-dimensional vectors A (X 1 1, X 12, …, X 1N) and b(x2 1, x22, …, x2n).

Can't you see that these two formulas are equivalent? Tip: Try scaling and pinching to prove it.

The distance of Min is not a distance, but a definition of a set of distances.

Definition of (1) minimum distance

The Minkowski distance between two N-dimensional variables A (x 1 1, x 12, …, x 1n) and b(x2 1, x22, …, x2n) is defined as:

Where p is a variable parameter.

When p= 1, it is the Manhattan distance.

When p=2, it is Euclidean distance.

When p→∞, it is Chebyshev distance.

According to different variable parameters, the distance of Min can represent a kind of distance.

(2) Min distance is insufficient

Min distance, including Manhattan distance, Euclid distance and Chebyshev distance, has obvious shortcomings.

For example, there are three samples: A (150- 190), B (190,50) and C (50-60). Then the minimum distance between A and B (whether Manhattan distance, Euclid distance or Chebyshev distance) is equal to the minimum distance between A and C, but is the height of 10cm really equal to the weight of 10kg? So it is very problematic to measure the similarity between these samples by the distance of Min.

To put it simply, Min's distance has two main disadvantages:

(1) treats the proportion (i.e. "unit") of each component as the same.

(2) Distribution (expectation, variance, etc. ) may be different.

Definition of (1) Standard Euclidean Distance

Standardized Euclidean distance is an improved scheme aiming at the shortcomings of simple Euclidean distance. The idea of standard Euclidean distance: since the distribution of each dimension component of data is different, fine! Then, I will "standardize" all components so that their mean and variance are equal. How standardized are the mean and variance? We review some statistical knowledge here. Assuming that the average value of the sample set X is m and the standard deviation is s, then the "standardized variable" of X is expressed as:

If the reciprocal of variance is regarded as a weight, this formula can be regarded as a weighted Euclidean distance.

Definition of (1) Mahalanobis Distance

There are m sample vectors X 1~Xm, the covariance matrix is labeled as s, and the mean value is labeled as vector μ, then the Mahalanobis distance of the sample vector x to u is expressed as:

If the covariance matrix is identity matrix (each sample vector is independent and identically distributed), the formula becomes:

The Euclidean distance.

If the covariance matrix is a diagonal matrix, the formula becomes a standardized Euclidean distance.

(2) Advantages and disadvantages of Mahalanobis distance: the dimension is independent, excluding the interference of correlation between variables.

Are you kidding? I don't study geometry. How did you get the cosine of the included angle? Ladies and gentlemen, relax. The cosine of the included angle in geometry can be used to measure the difference between two vectors, and this concept is borrowed in machine learning to measure the difference between sample vectors.

(1) cosine formula of the angle between vector A(x 1, y 1) and vector B(x2, y2) in two-dimensional space;

(2) The cosine of the angle between two N-dimensional sample points A (X 1 1, X 12, …, X 1N) and b(x2 1, x22, …, x2n).

The range of cosine of included angle is [- 1, 1]. The greater the cosine of the angle, the smaller the angle between two vectors, and the smaller the cosine of the angle, the greater the angle between two vectors. When the directions of the two vectors coincide, the cosine of the included angle takes the maximum value 1, and when the directions of the two vectors are completely opposite, the cosine of the included angle takes the minimum value-1.

Definition of (1) Hamming Distance

The Hamming distance between two equal-length character strings s 1 and s2 is defined as the minimum number of substitutions required to change one character string into the other. For example, the Hamming distance between the character strings "11"and "100 1" is 2.

Application: information coding (in order to enhance fault tolerance, the minimum Hamming distance between codes should be as large as possible).

(1) Jakad similarity coefficient

The proportion of intersection elements of two sets A and B in the union set of A and B is called Jacobian similarity coefficient of two sets, which is represented by symbol J(A, B).

Jakade similarity coefficient is an index to measure the similarity between two sets.

(2) Jakade distance

The opposite concept to Jacobian similarity coefficient is Jacques distance. Jacobian distance can be expressed by the following formula:

Jakade distance measures the discrimination between two sets by the proportion of different elements in all the elements in the two sets.

(3) Application of Jakade similarity coefficient and Jakade distance.

Jakade similarity coefficient can be used to measure the similarity of samples.

Sample a and sample b are two n-dimensional vectors, and the values of all dimensions are 0 or 1. For example: A (011) and B (101). We regard the sample as a set, 1 means that the set contains elements, and 0 means that the set does not contain elements.

P: both sample a and sample b are dimensions of 1

Q: Sample A is 1 and sample B is the dimension of 0.

R: sample a is 0 and sample b is 1.

S: both sample a and sample b are dimensions of 0.

Then the Jacobian similarity coefficient of samples A and B can be expressed as:

Here p+q+r can be understood as the number of elements in the union of a and b, and p is the number of elements in the intersection of a and b.

The jekade distance between samples a and b is expressed as:

Correlation coefficient is a method to measure the correlation degree of random variables X and Y, and the range of correlation coefficient is [- 1, 1]. The greater the absolute value of the correlation coefficient, the higher the degree of correlation between X and Y. When X and Y are linearly correlated, the correlation coefficient is 1 (positive linear correlation) or-1 (negative linear correlation).

(2) Definition of correlation distance

Information entropy does not belong to similarity measure. Then why put it in this article? This. . . I don't know, either. (╯▽╰)

Information entropy is a measure of the degree of confusion or dispersion of distribution. The more dispersed (even) the distribution, the greater the information entropy. The more orderly (or concentrated) the distribution, the smaller the information entropy.

Meaning of parameters:

N: the number of classifications of sample set X.

Probability of class I elements in pi: x

The greater the information entropy, the more dispersed the classification of sample set S, the smaller the information entropy, and the more concentrated the classification of sample set X.. When the probability of n classifications in S is the same (all are 1/n), the maximum value of information entropy is log2(n). When x has only one classification, the minimum value of information entropy is 0.