One of the main characteristics of time series data is that the data often jumps quickly with time, as shown in the following figure:
As can be seen from the figure, because the data points are too dense and fluctuate very frequently, all the data points have been overlapped once after being connected by dotted lines, and the visualization effect is poor, so it is necessary to sample the original data points, for example, 10000 original data points are sampled into 200 data points.
A simple data sampling algorithm is to find the average, maximum and minimum statistics. For example, the above example of sampling 10000 original data points into 200 data points can divide the original data points into 200 groups, each group contains 50 original data points (200*50= 10000), and then average all the original data points in each group.
This algorithm is very simple, but there is a problem, that is, many details of the fluctuation of the original data are lost, as shown in the following figure:
The gray lines in the figure are raw data, and the dark lines are sampled data. It can be seen that the data becomes smoother after sampling, and many details are lost. In particular, a very obvious peak of the original data in the red box is directly erased, which is likely to represent a business anomaly.
In the paper "Downsampled Time Series for Visual Presentation", a data sampling algorithm LTTB (there are several other similar algorithms, see the paper) is mentioned, which can keep the fluctuation details of the original data while sampling, and the specific principle is not expanded. The effect of the algorithm is shown here.
Can be widely used in monitoring products, and can well solve the following two problems: