Spatial Data Mining (SDM) is a process of discovering the initially unknown, potential and valuable rules hidden in spatial data. Specifically, spatial data mining is to extract credible and potentially useful knowledge by combining deterministic set, fuzzy set, bionics and other theories, using artificial intelligence, pattern recognition and other scientific technologies, and discover the hidden laws and relationships behind spatial data sets, thus providing theoretical and technical basis for spatial decision-making [1].
1. General steps of spatial data mining
Spatial data mining system can be roughly divided into the following steps:
(1) spatial data preparation: select various data sources, including map data, image data, terrain data, attribute data, etc.
(2) Spatial data preprocessing and feature extraction: The purpose of data preprocessing is to remove noise in data, including data cleaning, data conversion and data integration. Feature extraction is to remove redundant or irrelevant features and transform them into new features suitable for data mining.
(3) Spatial data mining and knowledge evaluation: spatial data are analyzed, processed and predicted by using spatial data mining technology, so as to find some connection behind the data. Then, combined with the knowledge of specific fields, it is evaluated to see whether the expected effect is achieved.
2. Research on spatial data mining methods.
Spatial data mining is a comprehensive interdisciplinary subject, which combines many characteristics of computer science, statistics, geography and other fields, and produces a large number of mining methods to deal with spatial data.
2. 1 spatial association rules
Mining association rules is to find the relationship between data items, which is expressed as X→Y, where x and y are two disjoint sets of data items, that is, X∩Y=? Let's leave now. KOPERSKI K and others combined association rules with spatial database and proposed spatial association rules mining [2]. Spatial association rules replace data items with spatial predicates, which are usually expressed as follows:
a 1∧A2∧…∧An→b 1∧B2∧…∧Bm(3)
Let A=(A 1, A2, …, An), B=(B 1, B2, …, Bm), where A and B represent the predicate sets of Ai and Bj respectively. A and b can be spatial predicates or non-spatial predicates, but they must contain at least one spatial predicate and A∩B=? Let's leave now. According to the characteristics of spatial association rules, SHEKHAR S and HUANG Y put forward the concept of summarizing the idea of association rules as spatial collocation rules of spatial index point sets, and replacing transactions with neighbors without violating spatial correlation [3]. Temporal-spatial correlation not only involves the spatial correlation of events, but also considers the spatial position and time series factors. Chai Siyue, Su Fenzhen and Zhou Chenghu proposed a method for mining spatio-temporal association rules based on periodic table [4].
2.2 Spatial clustering
Spatial clustering analysis is an extension of ordinary clustering analysis, and it can't completely deal with spatial data according to the clustering analysis method that deals with ordinary data. Because of the existence of the first law of geography, that is, there is a certain correlation between spatial objects, spatial autocorrelation should be considered in the definition of clustering in spatial clustering analysis. Through the autocorrelation analysis of spatial data, we can judge whether there is spatial correlation between objects, so we can reasonably judge whether objects can be divided into a cluster.
The basic clustering mining algorithms are:
(1) partition clustering algorithm: there are n data objects. For a given k groups (k≤n), N objects are iteratively optimized by a partition rule based on a certain goal until N objects are assigned to K groups, so that the similarity of objects within each group is greater than that between groups.
(2) Hierarchical clustering algorithm: By continuously splitting and reorganizing the data, the data is finally transformed into a hierarchical clustering tree that meets certain standards.
(3) Density clustering algorithm: divide the data object into low-density regions, and finally cluster the data object into several high-density regions.
(4) Graph clustering algorithm: each data object is represented by a spatial node, and then several subgraphs are formed according to certain standards. Finally, all subgraphs are clustered into a complete graph containing all spatial objects, and subgraphs represent spatial clustering.
(5) Grid clustering algorithm: divide the spatial region into a plurality of grid units with multi-resolution and grid structure characteristics, and cluster the data on the grid units.
(6) Model clustering algorithm: With the help of a certain mathematical model, the data are clustered by using the mathematical model which is most consistent with the data, and each cluster is represented by a probability distribution.
Generally, only one algorithm can not achieve satisfactory expected results. Wang Jiayao, Zhang Xueping and Zhou Haiyan combined the genetic algorithm with the K- means algorithm and put forward the genetic K- means algorithm for spatial clustering analysis [5]. In the real space environment, there are many obstacles, such as roads, bridges and rivers. Zhang Xueping, Yang Tengfei and others combined the K-Medoids algorithm with the quantum particle swarm optimization algorithm for clustering analysis under the constraints of spatial obstacles [6].
2.3 Spatial classification
Classification is simply the process of obtaining a certain classification model through learning, and then dividing data objects into predetermined classes according to the classification model. Spatial classification not only considers the non-spatial attributes of data objects, but also considers the influence of non-spatial attributes of adjacent objects on their categories. It is a supervised analysis method.
Spatial classification mining methods include statistical method, machine learning method and neural network method. Bayesian classifier is a statistical method, which uses the prior probability and Bayesian formula to calculate the posterior probability of data objects, and selects the class with higher posterior probability as the mapping category of objects. Decision tree classifier is a machine learning method, which adopts a top-down greedy strategy. By comparing the attribute values of the internal nodes of the decision tree, the branches of the decision tree are established. Each leaf node represents the attribute value that meets certain conditions, and the path from the root node to the leaf node represents an appropriate rule. Support vector machine is also a machine learning method. The idea is to map the training data set to a higher dimension by nonlinear mapping, and then find the maximum edge hyperplane to classify the data objects. Neural network is a kind of network that simulates human nerves. It consists of a group of interconnected input and output units, and gives each connection a corresponding weight. By adjusting the weight of each connection, data objects can be correctly classified.
For spatial classification mining with spatial autocorrelation, SHEKHAR S and others use spatial autoregressive model and Bayesian Markov random field for spatial classification mining [7], while Wang Min, Luo, and others combine Gaussian Markov random field and support vector machine for remote sensing image information extraction [8].
2.4 Other spatial excavation methods
There are many methods of spatial data mining, others include: spatial analysis, that is, using GIS methods, technologies and theories to process spatial data, so as to find out unknown and useful information patterns; Methods based on fuzzy set, rough set and cloud theory can be used to analyze uncertain spatial data. Visualization method is the visual representation of spatial data objects, which shows the spatial data to be analyzed in the form of images through certain technology, thus obtaining its hidden information; In China, Zhang Zijia, Yue Bangshan and Pan Qi combined ant colony algorithm with fuzzy clustering algorithm of adaptive filtering to segment images [9].
3. Conclusion
As an extension of data mining, spatial data mining has a good theoretical basis for traditional data mining methods. Although great progress has been made, its theory and method still need further in-depth study. With the advent of the era of big data, facing more and more spatial data, improving the accuracy and precision of data mining is a problem that needs to be studied. At the same time, the time complexity of the popular spatial data mining algorithm is still between O (NLOG (N)) and O (N3), and the efficiency of the data mining algorithm needs to be further improved. Data mining has been well applied in the cloud environment [10], and spatial cloud computing for processing spatial data is the research direction for scholars. Most spatial data mining algorithms do not consider obstacles, so how to solve obstacles in reality is worth discussing. Spatial data with time attribute presents a dynamic and changeable spatial phenomenon, and spatio-temporal data mining will be the focus of future research.
Because data mining involves many subjects, its basic theories and methods are relatively mature. For spatial data mining, how to use and expand these theoretical methods reasonably to realize spatial data mining will still be the direction that researchers need to work hard for a long time.
refer to
[1], Wang, Theory and Application of Spatial Data Mining (2nd Edition) [M]. Beijing: Science Press, 20 13.
Koperski K, Han Jianwei. Discovery of spatial association rules in geographic information databases [C]. Proceedings of the 4th International Symposium on Spatial Database Progress, 1995: 47-66.
[3] SHEKHAR S, Huang Y. Discovering Spatial Co-location Pattern: Summary of Results [C]. Proceedings of the 7th International Symposium on Progress of Spatio-temporal Databases, 200 1:236-256.
Chai Siyue, Su Fenzhen, Zhou Chenghu. Method and experiment of mining spatio-temporal association rules based on periodic table [J]. Journal of Geographic Information Science, 201,13 (4): 455-464.
Wang Jiayao, Zhang Xueping, Zhou Haiyan. A Genetic K- means Algorithm for Spatial Clustering Analysis [J]. Computer Engineering, 2006,32 (3):188-190.
Zhang Xueping, Du Haohua, Yang Tengfei, et al. A New Spatial Clustering Method for Obstacle Constraints Based on PNPSO and K-medoids [C]. Progress of Swarm Intelligence, Lectures on Computer Science (LNCS), 20 10: 476-483.
Shekhar S, SCHRATER P R, VATSAVAI R R, et al. Spatial Context Classification and Prediction Model for Geospatial Data Mining [J].IEEE Multimedia Transactions, 2002,4 (2):174-187.
Wang Min, Luo,, et al. Road network extraction from high-resolution remote sensing images by combining texture model of Gaussian Markov random field and support vector machine [J]. Acta Remote Sensing, 2005,9 (3): 271-275.
Zhang Zijia, Yue Bangshan, Pan Qi, et al. Fuzzy clustering image segmentation based on ant colony and adaptive filtering [J]. Application of electronic technology, 2015,41(4):144-147.
[10] Shi Jie. Application of data mining in cloud computing environment [J]. Microcomputer and its application, 2015,34 (5):13-15.
Source | AET Electronic Technology Application