How to deal with missing data in cluster analysis

(1) Delete data with missing values. If only a few data in the dataset have missing values, it may be reasonable to ignore them. However, if there are many data with missing values in a given data set, it is difficult to analyze the data reliably by this strategy, and the data with missing values also contain certain information, which may be very important for analysis, so we should ignore them very carefully and ensure that it will not affect the analysis. (2) estimate the missing value. Sometimes, missing values can be estimated reliably according to the characteristics of data. Specifically, to estimate the missing values according to the attribute values of adjacent points, the average attribute values of adjacent points are often selected to replace the missing values, and sometimes the average values of data sets are selected to replace the missing values, or curve fitting is performed, and the appropriate attribute values are selected according to the fitting results. (3) Ignore missing values. Many clustering algorithms can be used to directly process data with missing values, such as calculating the similarity between objects. For data with missing values, the similarity can be calculated by using the attribute values without missing values, which is only approximate. Unless there are few attributes of the whole data or many data have missing values, the error has little influence.