In text mining, the topic model is a special chess piece, and its idea is different from our commonly used machine learning algorithm, so we need to summarize the algorithm of text topic model here. This paper mainly studies the principle of latent semantic indexing algorithm.
The problem characteristics of 1. text theme model
In data analysis, we often use unsupervised learning clustering algorithm, which can unsupervised cluster our feature data. Topic model is also an unsupervised algorithm, the purpose of which is to get the probability distribution of text according to the topic. In this respect, the topic model is very similar to the common clustering algorithm. But there is actually a difference between the two.
Clustering algorithm focuses on clustering data from the similarity of sample characteristics. For example, Euclidean distance between data samples, Manhattan distance clustering and so on. Topic model, as its name implies, is a modeling method for hidden topics in text. For example, from the two words "Yi in the name of people" and "Da Kang Shuji", we can easily find that the corresponding texts have great thematic relevance, but it is difficult to find them if they are clustered by word features, because the clustering method cannot take into account the implied themes.
So how to find the hidden theme? This is a big problem. The commonly used methods are generally based on statistical generation. That is, suppose a topic is selected with a certain probability, and then the words of the current topic are selected with a certain probability. Finally, these phrases have become our current texts. The statistical probability distribution of all words can be obtained from the corpus, and how to choose them with a certain probability is the task of various topic-specific model algorithms.
Of course, there are also some methods that are not based on statistics, such as LSI that we will talk about next.
2. Overview of Latent Semantic Indexing
Latent Semantic Index (LSI), some articles are also called Latent Semantic Analysis (LSA). In fact, it is a thing, and later we will collectively refer to LSI, which is a simple and practical theme model. LSI is based on singular value decomposition (SVD) to obtain the theme of text. SVD and its application have been mentioned many times in previous articles, such as the principle of singular value decomposition and its application in dimensionality reduction, and the application of matrix decomposition in collaborative filtering recommendation algorithm. If you are not familiar with SVD, it is suggested to review the principle of singular value decomposition and its application in dimension reduction before reading the following contents.
Here we briefly review SVD: for an m×n matrix A, it can be decomposed into the following three matrices:
am×n = um×mσm×nvn×nT
Sometimes in order to reduce the dimension of matrix to k, the decomposition of SVD can be approximately written as:
am×n≈um×kσk×kvk×nT
If the above formula is applied to our theme model, SVD can be interpreted as: We have entered m texts, and each text has n words. Aij corresponds to the eigenvalue of the j word of the ith text, and the most commonly used value here is based on the normalized TF-IDF value after preprocessing. K is the number of topics we assume, which is generally less than the number of texts. After SVD decomposition, Uil corresponds to the correlation between the I-th text and the L-th topic. Vjm corresponds to the correlation between the meanings of the j th word and the m th word. σ lm corresponds to the correlation between the first theme and the meaning of the word m.
It can also be explained in reverse: we entered m words, corresponding to n texts. Aij corresponds to the characteristic value of the j-th text of the i-th word file, and the most commonly used value here is based on the pre-processed standardized TF-IDF value. K is the number of topics we assume, which is generally less than the number of texts. After SVD decomposition, Uil corresponds to the correlation between the meaning of the i-th word and the l-th word. Vjm corresponds to the correlation between the j th text and the m th topic. σ lm corresponds to the association between the first meaning and the m theme.
In this way, we can get the relevance between documents and themes, words and meanings, meanings and themes through SVD once.
3. A simple example of 3.LSI
The following is a simple LSI example. Suppose we have the following word frequency TF corresponding matrix with 10 words, and the three texts are as follows:
We don't use preprocessing or TF-IDF here. In practical application, it is best to use the preprocessed TF-IDF value matrix as input.
We assume that the number of corresponding topics is 2, so the three matrices obtained after dimension reduction of SVD are:
From the matrix Uk, we can see the correlation between words and meanings. From Vk, we can see the correlation between three texts and two themes. As you can see, there are negative numbers in it, so the correlation obtained in this way is difficult to explain.
4.LSI is used for text similarity calculation.
The text theme matrix obtained by LSI can be used to calculate the text similarity. The calculation method is generally through cosine similarity. For example, for the above three documents and two topics. We can calculate the cosine similarity between the first text and the second text as follows:
sim(d 1,d2)=(? 0.4945)? (? 0.6458)+(0.6492)? (? 0.7 194)(? 0.4945)2+0.64922(? 0.6458)2+(? 0.7 194)2
5. Overview of 5.LSI theme model
LSI is the earliest theme model, and its algorithm principle is very simple. A singular value decomposition can get the theme model and solve the word meaning problem at the same time, which is very beautiful. However, LSI has many shortcomings, which makes it basically no longer used in the current practical theme model.
The main questions are:
1) SVD calculation is very time-consuming, especially in our text processing, the number of words and the amount of text are very large, so it is very difficult to do singular value decomposition on such a high-dimensional matrix.
2) The choice of topic value has a great influence on the result, so it is difficult to choose a suitable K value.
3) LSI is not a probabilistic model and lacks statistical basis, so it is difficult to explain the results intuitively.
For the problem 1), NMF can solve the speed problem of matrix decomposition. For question 2), this is a long-standing problem. The number of topics in most topic models is generally selected by experience, and the newer hierarchical Dirichlet process (HDP) can automatically select the number of topics. For question 3), scalpers developed topic models based on probability distribution, such as pLSI (also called pLSA) and Hidden Dirichlet Distribution (LDA), instead of topic models based on matrix decomposition.
Back to LSI itself, for some small-scale problems, LSI is a better choice if you want to quickly and roughly find out the relationship between some topics. At other times, if you need to use the theme model, recommend LDA and HDP.