Based on large-scale trajectory data, the related crowd query function is developed and provided to solve the problems that are difficult to be found by susceptible people. By matching and mining the trajectory, people who have had "contact" with the trajectory of the diagnosed person in time and space can be quickly found out. Among them, one of the most important tasks to realize this function is how to measure the similarity of two trajectories. The following will briefly introduce some common trajectory similarity measurement methods.
Trajectory, as a kind of spatio-temporal data [1], refers to the moving path of an object in space, which is usually expressed as a sequence of GPS points, such as tr=, where point pi=(lat, lng, t) indicates that the object is at the geographic coordinate position (lat, lng) at time t, and lat and lng respectively indicate latitude and longitude.
In the era of big data, with the popularization of car navigation system, a large number of trajectory data are constantly being generated, which contains great value [2]. For example, traffic flow can be analyzed and predicted to provide suggestions for the government's urban planning; Trajectory clustering can also be carried out to find roads crossed by multiple trajectories to guide the planning of bicycle lanes; You can also detect the stagnation point and find the area where the trajectory often stays.
The analysis and processing of trajectory data is very challenging, which mainly includes three aspects: 1) the amount of trajectory data is large; 2) The trajectory data is noisy; 3) There are various ways to obtain trajectory data. Among them, trajectory similarity, as a basic algorithm service, measures the distance between two trajectories, which can support its upper application and is also one of the hot spots in current research.
Compared with point-to-point or point-to-track distance measurement, the distance measurement between tracks is more complicated, and many factors need to be considered, such as the sampling rate of tracks, the time information considering tracks, and the noise of tracks themselves. Common trajectory similarity measurement methods are roughly classified as shown in the following figure.
We define the following two trajectories of length n and m respectively, and then:
Euclid distance Euclid distance requires two trajectories to be equal in length and correspond to each other one by one. Its mathematical definition is:
The definition of Euclidean distance is simple and clear, that is, the average of the spatial distances of the corresponding points of two trajectories, but the disadvantages are also obvious, that is, the similarity of trajectories with different lengths cannot be measured and is sensitive to noise points.
Dynamic time warping (DTW)
As mentioned above, an obvious limitation of Euclidean distance is that two trajectories are required to be equal in length, which is unlikely in practical situations. Ideally, the track length should be flexible.
The idea of dynamic time warping DTW[3] is to automatically warp two sequences, scale and align them locally on the time axis, so that their shapes are as consistent as possible, so as to obtain as great a similarity as possible. DTW maps the points of two trajectories into many-to-many, thus effectively solving the problem of uneven data. Its dynamic programming algorithm is as follows:
Where head(tr)= 1
Dynamic time normalization algorithm is flexible, unlimited and effective, but it does not deal with noise points, and outliers will also have a great impact on the results.
Longest common subsequence (LCSS)
There is a classical algorithm problem: to solve the longest common subsequence of two sequences, it is not required to connect two continuous common subsequences, for example, the largest common subsequence of BDCABA and ABCBDAB is BCBA. On this basis, it is natural to propose a trajectory similarity measurement method based on the longest common * * * subsequence, that is, LCSS, whose value represents the number of points that can be regarded as the same point at most, that is, the logarithm of trajectory points in two trajectories that meet the minimum distance threshold. The algorithm based on dynamic programming is as follows:
Among them, the parameter is the minimum distance threshold, and when the distance between two points is less than this value, it will be considered as the same point. In addition, the algorithm has no limit on the trajectory length.
The longest common substring distance deals with the noise point, that is, because the deviation of the noise point is not close to its tracking point, it will not be included in the final result. This step is very effective for noise. But at the same time, the minimum distance threshold of the algorithm is difficult to define, and it is possible to return dissimilar trajectories.
Edit the distance (EDR) of the real sequence
Given two trajectories tr 1 and tr2 with lengths of n and m respectively, and the matching threshold of the minimum distance, the EDR distance between the two trajectories [4] is the number of operations that need to insert, delete or replace tr 1 and make it tr2. The algorithm based on dynamic programming is as follows.
Among them,
The editing distance of trajectory provides a new idea for the new trajectory similarity measurement, and its defect is also obvious, that is, it is sensitive to noise points.
hausdorff distance
Simply put, Hausdorff distance [5] is the maximum distance between the nearest points of two trajectories.
Where h(tr 1, tr2) is called the one-way Hausdorff distance from tr 1 to tr2, and its definition is as follows:
Fréchet distance (Fréchet distance)
Intuitively speaking, the Fréchet distance [6] is the dog leash distance, that is, the shortest dog leash length required for the owner to take the path A and the dog to take the path B respectively.
The Fréchet distance algorithm based on dynamic programming idea is as follows:
Where d(p, q) is the Euclidean distance between two GPS points, and tr(n- 1)= is the sub-trajectory of the trajectory tr with the length of n- 1.
Fréchet distance provides us with a simple and intuitive method to measure similarity, and it can also achieve good results. Unfortunately, however, it does not deal with noise points. For example, if the dog's tracking point deviates far because of noise, the Fréchet distance will also increase, which is obviously unreasonable.
One-way distance (OWD)
One-way distance OWD[7] is defined as follows:
Where |tr 1| represents the length of the trajectory tr 1, and d(p, tr2) represents the distance from the GPS point p to tr2. For symmetry, just modify the above formula:
Figure 1 1: One-way distance diagram, which is the ratio of the sum of polygon areas to the length of polyline.
The basic concept of OWD distance is based on the area surrounded by two tracks. The large area indicates that the distance between tracks is far and the similarity is low. On the contrary, if the closed area is 0, it means that the two tracks overlap and have the highest similarity.
Position between polylines (LIP)
The multiline position distance LIP[8] is defined as follows:
Where Ii represents the ith intersection of two trajectories, and the weight wi is defined as follows:
LIP method is easy to understand, when the perimeter of a certain area accounts for a large proportion of the total length, the weight is naturally large; When the area is 0, it means that there is no gap between the two tracks and the lip distance is 0; When the weighted sum of the areas is large, it means that the gap between the two tracks is large and the lip distance is also large. In addition, the weight is determined by the ratio of the perimeter of the region to the total length, which also resists the interference of noise points to some extent.
At present, JUST (JD city spatio-temporal data engine), a spatio-temporal data engine developed by JD.COM, has realized many trajectory similarity calculation methods such as dynamic time warping, Hausdorff distance and Fréchet distance. Users only need one SQL to realize it, without coding, but also provide a trajectory visualization function for users to observe. Welcome to visit/experience. Please send an email to just@jd.com for cooperation.
References:
[1] Zheng, Y, et al. Urban computing: concept, method and application. ACM。
[2] JUNG WOO. Trajectory data mining: a survey. Transactions of the American Computer Society on Intelligent Systems and Technologies. 20 15 Volume 6, Issue 3.
Yi B K, Jagadish H V, Faloutsos C. Efficient retrieval of similar time series under time warping [C]//Proceedings 14 The 4th International Conference on Data Engineering. IEEE, 1998: 20 1-208。
[4] Bobby Chen,? Robust Fast Similarity Search of Moving Object Trajectory [C]//Proceedings of 2005 ACM SIGMOD International Conference on Data Management. 2005: 49 1-502.
[5] Alt H. Computational geometry of comparison shapes [M]// Efficient algorithm. Springer, Berlin, Heidelberg, 2009: 235-248.
Eiter T, Mannila H. Calculating Discrete Fréchet Distance [R]. Technical Report CD-TR 94/64, Christian Johann Expert System Laboratory, University of Vienna, Austria, zip code: 1994.
[7] Paper: Lin B, Su J. One-way distance: used for shape-based similarity search of moving object artifacts [J]. Geographic information, 2008,12 (2):17-142.
[8] LIP Paper: Pelekis N, Kopanakis I, Marketos G, et al. Similarity Search in Trajectory Database [C]/14 The 4th International Symposium on Temporal Representation and Reasoning (TIME'07).IEEE, 2007:129-/kloc-0.