Paper reading _ time series clustering K-shape

This paper was published at the 20 15 SIGMODE international summit on data management. Aiming at the clustering problem of time series data, a K-shaped method is proposed. Compared with the previous methods, the distance calculation method and centroid calculation method are optimized, and the frequency domain feature extraction method is introduced to improve the efficiency.

The author thinks that this is a domain-independent, high-precision and high-efficiency time series clustering method.

I think it has better clustering effect than traditional methods. Compared with DTW method, the effect is slightly worse, but the speed is much faster. After all, in principle, K-shape only considers longitudinal stretching and lateral translation, while DTW also considers lateral stretching.

The principle of K-Shape is similar to K-means, but the difference is that the distance calculation method is improved and the centroid calculation method is optimized. On the one hand, it supports the invariance of amplitude scaling and translation, on the other hand, the calculation efficiency is relatively high, and there is no need to set parameters manually, which is convenient to expand to more fields.

Distance algorithm is used to calculate the difference between two groups of time series data, and the core problem is how to deal with the deformation of time series data. The ECG data shown in Figure-1 in this paper can be divided into two categories: A/B:

Among them, the characteristics of class A are: rising->; Decline-> Rising, and the characteristics of class B are: falling->; Stand up. Figure-1 The following figure on the right is an ideal modeling effect, which recognizes the same pattern and ignores the differences in amplitude and phase. People prefer to use this method to calculate the distance, and many times even think that the distance calculation method is more important than the clustering method. Generally speaking, the methods supporting amplitude scaling and translation invariance have high computational cost and are difficult to model a large number of data.

The mainstream distance algorithm before K-shape is as follows:

K-Shape calculates the distance between two time series by cross-correlation method. Suppose there are two time series, X and Y, and the sequence length is both m. In order to achieve translation invariance and Y invariance, draw X step by step and calculate the difference between X and Y in each step.

As shown in the above figure, if the green area is Y and the white area is X, then each line s∈(-3 step) moves forward one step, and the sequence length is m=4, s ∈ (-3,3) * * 7 values, w is the possibility of all movements 2m- 1=7 times, and w-m = s =

Define the cross-correlation coefficient CC:

Calculate the similarity between X and Y in each step with R, and calculate the dot product of the opposite position (existing in both X and Y). Finally, r is the sum of the dot products of the effective area (the small pieces on each pair are added). It can be said that the greater R, the more similar the two sequences are.

Because the amplitude of each subsequence is different and the number of blocks is different, there are three normalization methods, and the third one adopts cross-correlation method, which has the best effect.

The normalization effect is shown in the following figure:

Among them, Figure (a) only uses Z normalization to normalize the amplitude, and there is no translation, so it can be seen that the alignment effect without translation (face up) is the best. From (b), (c) and (d), it can be seen that: (d) the graph adopts the third method, and the similarity value is the largest at the midpoint (when s=0), that is, the similarity is the largest when the pair is positive, which is consistent with the effect presented in (a). And (b) and (c) both think that the most similar situation appears on the right, which is obviously not quite right.

In this paper, SBD(Shape-based distance) is defined. The more overlapping blocks there are, the more the shape looks like CC. Compare the similarity values of all possible positions, take the most similar max(CC), and then use 1-max(CC) to get SBD, that is, the more similar the shape, the smaller the distance from SBD, and the normalized NCC value is between.

It can be seen that when the sequence is long, the time complexity of the above method is high, and when the sequence is long, the calculation amount will be large. In order to solve this problem, the author proposes to transform the sequence from time domain to frequency domain by Fourier transform, and then compare them to save the calculation.

After defining the distance, we need to adjust the centroid algorithm according to the distance logic.

As can be seen from Figure -4, the centroid of time series data is also a time series variation line, and the blue line in the figure uses the average method (calculating the average value of each point) to calculate the centroid; Because of dislocation, the peaks and valleys are drawn in a straight line, so the shape trend cannot be correctly expressed.

K-Shape uses the method based on SBD to calculate the center of mass.

The goal of this formula is to find μk* and maximize the similarity NCC between the centroid μk and each sequence xi in the cluster Pk.

Algorithm 1: First, use SBD () function to calculate dist and Y', where dist is the distance between X and Y in time series, and Y' is the most matching subsection of Y. This method is used to solve the problem that the peaks and valleys are not aligned and make them cancel each other out.

Then that formula 13 is expand into the formula 15 by the method based on linear algebra:

Finally, it can be simplified by Rayleigh quotient formula:

An important property of Rayleigh quotient R(M, x) is that the maximum value of R is equal to the maximum eigenvalue of matrix M and the minimum value is equal to the minimum eigenvalue of matrix M. At this time, you don't have to think too much about X in R(M, x) (that is, uk in this question). Equation 13 is simplified to the following algorithm:

Algorithm 2: ShapeExtraction () calculates a more reasonable centroid C' according to the current centroid C of the cluster and all points X in the cluster.

Line 2: Traverse all points X(i) in the cluster.

Line 3: Calculate the distance dist between each point and the centroid and the line segment X' which is most similar to the centroid.

Line 4: Add the most similar fragment to X'

Line 5: X' transpose is multiplied by X to generate a square matrix (the square of X).

Line 6: Create a matrix Q for regularization.

Line 7: Generate matrix M after regularization.

Line 8: Take the eigenvector when the matrix M corresponds to the maximum eigenvalue, and realize the feature extraction of X'

(The above explanation is personal understanding, not necessarily correct, for reference only)

The final clustering method is realized by iteration, and each iteration is divided into two steps: the first step is to recalculate the centroid, and the second step is to redistribute each sequence into different clusters according to the distance from the new centroid; Iterate until the label no longer changes.

Algorithm 3: The whole process of clustering is realized by k-Shape ();

Where x is all sequences, k is cluster number, and IDX is label.

Line 3: Before the label is stable &; Continue iteration under the condition that the number of iterations does not exceed 100.

Line4- 10: recalculate the centroid c of each cluster according to the elements in the cluster.

Line 1 1 Line-Line 17: Calculate the distance between each sequence and each centroid and assign it to a new cluster (re-label).

The time required for each iteration of the K-shape algorithm is:

O (maximum {n k m log(m), n m^2, k m 3})

Where n is the number of instances, k is the number of clusters, and m is the sequence length. It can be seen that most of the calculation cost of this algorithm depends on the length m of the time series. But this length is usually much smaller than the number of time series, so the dependence on m is not the bottleneck. In rare cases where m is very large, segmentation or dimensionality reduction can be used to effectively reduce the length of the sequence.

Figure -5 compares the effects of K-shaped, ED and DTW models, and it can be seen that in most cases, SBD is better than ED, and in some cases, SBD is better than DTW. But SBD is better than DTW because it is faster.

Where is Li Decheng, a lawyer from Beijing Zhonglun Wende Law Firm?

Why is Mencius' mother named Meng Mu and Confucius' mother named Mu Yan?

The human history of Jiucaiping

How to find the wrong words in the paper

Seeking ERP sandbox simulation experience.

Do you have a hundred surnames?

How to write down the conditions that have been met and needed to complete the project in the opening report?

How many years did feudal society exist?

He who wins the hearts of the people wins the world's argumentative essays

The proposition of female survival in the wilderness