A Survey of Clustering Algorithms in Data Mining

Wen | Su Mark

Source | Zhihu

This paper focuses on the principle, application process, application skills, evaluation methods and application cases of clustering algorithm. Please refer to relevant materials for details of the specific algorithm. The main purpose of clustering is customer clustering.

1. Clustering and classification

Classification is "supervised learning", and it is known in advance which categories can be classified.

Clustering is "unsupervised learning", and it is not known in advance which classes it will be classified into.

For example, apples, bananas, kiwis, mobile phones and telephones.

According to different characteristics, our clustering will be divided into apples, bananas and kiwifruit, while mobile phones and telephones are digital products.

Classification means that when we judge "strawberry", we classify it as "fruit".

So the popular explanation is: classification is to learn the ability to judge data from the training set, and then make classification judgment of unknown data; Clustering is to classify similar things into one category without training data to learn.

Academic explanation: Classification refers to analyzing a group of objects in the database and finding out their * * * same attributes. Then according to the classification model, they are divided into different categories. Classification data firstly establishes a classification model according to the training data, and then classifies the test data in the database according to these classification descriptions or generates more suitable descriptions.

Clustering means that the data in the database can be divided into a series of meaningful subsets, namely classes. In the same category, the distance between individuals is small, while the distance between individuals in different categories is large. Cluster analysis is often called "unsupervised learning".

2. Common applications of clustering

Our practical applications will include:

Marketing: customer grouping

Insurance: a high-claim customer base looking for auto insurance.

Urban planning: looking for the same type of property

For example, if you do buyer analysis and seller analysis, you will certainly hear the concept of customer grouping, which is divided into high-value customers, general-value customers and potential users according to standards, and provide different marketing schemes for customers with different values;

There are also customers with high claims from insurance companies, which is the most concerned issue for insurance companies and also affects their profits;

Also, when doing real estate, according to the geographical location, price and surrounding facilities of the real estate, cluster the hot real estate areas and cold real estate areas.

3. k- means

(1) suppose that the goal of k clusters (2) is to find compact clusters.

A. randomly initializing the cluster

B. Assign data to the nearest cluster

C. Repeated computing cluster

Repeat until convergence.

Advantages: local optimization

Disadvantages: Non-convex clusters have problems.

Where K=?

K<= sample size

Depending on the distribution of the data and the desired resolution

AIC，DIC