Current location - Education and Training Encyclopedia - Graduation thesis - Notes on the Cluster of Quantitative Ecology
Notes on the Cluster of Quantitative Ecology
This week, we begin the fourth chapter of Notes on Quantitative Ecology: Cluster Analysis. Cluster analysis, also known as grouping analysis, is a statistical analysis method to study the classification of (samples or indicators) and an important algorithm of data mining. In ecological research, the purpose of clustering is to identify a subset of discontinuous objects in the environment. In fact, cluster analysis is the grouping of the set of studied objects.

It should be noted that most clustering methods are based on correlation matrix, which also shows that it is very important to choose appropriate correlation coefficient.

As shown in the figure, we need to identify different types of clustering methods and their application conditions.

Simply connected clustering is also called nearest neighbor clustering, and the basis of this method is the shortest pairwise distance. The list of the first connection of each object or cluster becomes the main link and the minimum spanning tree.

The basis for allowing an object or cluster to aggregate with another group is the farthest distance pair.

A single connection means that an object can be easily aggregated into a group, because a single connection is enough to lead to fusion. Therefore, single connection clustering is also called closest friend method. Although the generated classification group is not clear, it is easy to identify the gradient. On the contrary, there are obvious differences between the classifications produced by fully connected clustering. Fully connected clustering often produces many small independent groups, which is more suitable for finding and identifying discontinuous distribution of data.

Average clustering is a clustering method based on the average dissimilarity of objects or clustering centers. There are four kinds of clustering, and the difference between them lies in the way of calculating the group position and whether the number of objects is included as a weight when calculating the fusion.

The most famous method is the UPGMA method. The basis for an object to join a group is the average distance between the object and each member in the group.

It should be noted that UPGMC and WPGMC sometimes cause the tree to turn over, and the classification results are difficult to explain.

This is a clustering method based on the criterion of least square linear model, and the grouping basis is the minimum sum of squares within the group (that is, the variance of analysis of variance).

It should be remembered that cluster analysis is an exploratory analysis, not a statistical test. The factors that affect the clustering results include the province of clustering methods and the correlation coefficient used in clustering analysis.

For any two objects that have completed hierarchical clustering, they will go up from one object on the cluster tree and down to the node that returns to the other object, and will inevitably reach the second object. The level of intersecting nodes is the same type distance between two objects.

In order to describe the correlation between distance matrix and homophenotypic matrix obtained by different clustering methods, Shepard of original distance relative to homophenotypic distance can be drawn.

Figure.

Sum of squares of the difference between the original distance and the same distance.

In order to explain and compare the results of clustering, it is usually necessary to find an interpretable cluster, which means that it is necessary to decide which layer the cluster tree should be cut to.

The fusion level value of cluster tree is the dissimilarity value where two branches in the cluster tree merge.

Use cutree () function to set the number of classification groups, and use contingency table to compare classification differences.

The contour width is a measure to describe the degree to which an object belongs to its cluster. It is the average distance between an object and other objects in the group and the comparison of the average distance between the object and all objects in the nearest cluster.

Reference:

What are the commonly used clustering algorithms? Six clustering algorithms are introduced in detail.

Unsupervised learning-clustering

Encyclopedia || Clustering algorithm

Cluster analysis