Algorithm process
Continuous attribute
Before calculating the distance, it is necessary to standardize the zero mean of each attribute value. In K-Means clustering algorithm, it is usually necessary to measure the distance between samples, between samples and clusters, and between clusters.
Zero mean normalization
Also known as standard deviation standardization, the average value of the processed data is 0 and the standard deviation is 1.
Conversion formula: the most commonly used data standardization method at present
In practice, in order to get better results, we usually choose different initial clustering centers and run the K-Means algorithm many times.
After all the objects are assigned, when the centers of k clusters are recalculated, for continuous data, the cluster center takes the average value of the cluster, but when some attributes of the sample are classified variables, the average value may not be defined, so the K- mode method can be used.
Error leveling method and SSE (sum of squares of errors) are used as objective functions to measure the quality of clustering. For two different clustering results, the classification result with smaller sum of squares of errors is selected.
abstract
The interval of grouping characteristics 1: R is relatively large, mainly concentrated in 30 80 days; The consumption times are concentrated in 0 15 times; Consumption amount: 0 ~ 2000;
Characteristics of cluster 2: R interval is relatively small, mainly concentrated in 0-30 days; The consumption times are concentrated in 0 10; Consumption amount: 0 ~1800;
Characteristics of cluster 3: R interval is relatively small, mainly concentrated in 0-30 days; The consumption times are concentrated in 10 25 times; The consumption amount is: 500 ~ 2000;
contrastive analysis
The third group is a group with high consumption and high value, with short time interval, many consumption times and large consumption amount.
Cluster 2 has medium time interval, consumption times and consumption amount, representing general value customers.
1 group customers with long time interval, few consumption times, not particularly high consumption amount and low value.