Current location - Education and Training Encyclopedia - University ranking - The whole process of SPSS cluster analysis is explained in detail.
The whole process of SPSS cluster analysis is explained in detail.
The whole process of SPSS cluster analysis is explained in detail.

Case data source:

There are 20 kinds of 12 oz beer composition and price data, and the variables include beer name, calorie, sodium content, alcohol content and price. Question 1: Which variables are selected for clustering? -adopting "R-type clustering"

1. Now we have four variables to classify beer. Is it necessary to include all four variables as classified variables? The three indexes of calorie, sodium content and alcohol content need to be determined by the hard work of the laboratory technician, and the cost is still a lot. Wouldn't it be too much trouble and waste if they were all included in the analysis? Therefore, it is necessary to reduce the dimensions of the four variables. Here, spss R-type clustering (variable clustering) is used to reduce the dimensions of four variables. The output of "similarity matrix" helps us to understand the process of dimension reduction.

2. The dimensions of the four classified variables are different. This time, we first determine to measure them by similarity, Pearson coefficient as the standard, and farthest element as the clustering method. At this time, correlation is involved, so the four variables need not be standardized, and the numbers in the future similarity matrix are the correlation coefficients. If the correlation coefficient of some two variables is close to 1 or-1, it means that these two variables can replace each other. 3. Just output the "tree diagram". Personally, I think the icicle diagram is very complicated and looks less clear than the tree diagram. As can be seen from the proximitymatrix table, the correlation coefficient of two variables, heat and alcohol content, is 0.903, which is the largest, so there is no need to choose either of them as a clustering variable, which leads to an increase in cost. As for choosing calorie and alcohol content as typical indicators to replace the original two variables, it can be decided according to professional knowledge or the difficulty of determination. (Different from factor analysis, one of the variables is completely kicked out to achieve the purpose of dimensionality reduction. ) Alcohol content is selected here. So far, the variables used for clustering have been determined as follows: alcohol content, sodium content and price.

Question 2: How many kinds of 20 kinds of beer can be divided into? -Using "q-cluster"1,20 kinds of beer are now clustered. At first, we should divide the uncertainty into several categories, and discuss it temporarily with a range of 3-5 categories. Q-type clustering requires the same dimension, so it is necessary to standardize the data, and this time it is measured by Euclidean distance square. 2, mainly through the tree diagram and icicle diagram to understand the category. Whether it is finally divided into 4 categories or 3 categories is a complicated process, which requires professional knowledge and initial purpose to identify. I tried to make sure it was divided into four categories. Select Save, and the clustering result will be automatically generated in the data area. Question 3: Do the variables used for clustering contribute to the clustering process and results, and are they useful? -Using "one-way ANOVA" 1, cluster analysis, in addition to the determination of categories, there is also a key question whether classification variables contribute to clustering. If individual variables have no influence on classification, they should be eliminated. 2. This process is generally judged by one-way analysis of variance. Note that the factor variables are clustered into four categories at this time, and the three clustered variables are treated as dependent variables. The results of variance analysis show that the sig values of the three clustering variables are extremely significant, and the three variables we use for classification can be used as clustering variables, which is reasonable. Question 4: Interpretation of clustering results? The last and most difficult step of cluster analysis is to define and explain the separated categories and describe their characteristics, that is, to describe their characteristics. This requires professional knowledge as the basis, combined with the purpose of analysis. 2. We can use the mean comparison process of spss or the pivot table function of excel to describe various indicators. Among them, the report is used to describe the clustering results. By comparing various indicators, the category is initially defined, and the judgment is mainly based on professional knowledge. Right here. The above process involves Q-cluster and R-cluster in spss hierarchical clustering, one-way ANOVA, mean process and so on. This is a good example of the combination of various analysis methods.