Generally speaking, if we get an article, it is an unmarked article. I hope to get the classification results of the article through the relevant algorithms of machine learning. This is our original intention.
As mentioned above, in the traditional clustering algorithm, when our data is large enough and has many features, our distance measurement formula loses some meaning. That is to say, high-dimensional data will face the problem of dimension explosion, and the distance between data in the corner of high-dimensional space is meaningless. If the distance measurement fails, the result of clustering algorithm will be poor.
Therefore, the traditional unsupervised algorithm can not play a very good role in dealing with text classification. Because we will use word bag method and TF-IDF in the feature processing of text classification, these methods are all based on our corpus to generate a word vector of the current article, that is, the size of each element in the word vector is determined according to the frequency of words appearing in the article.
In contrast, the generated word vectors are sparse, because it is impossible for an article to contain all the phrases in the corpus. What should I do in this case? We introduce the topic model to solve the problem of ranging failure. Relatively speaking, this result may be slightly better.
However, at present, there is a great controversy about the theme mode in the industry. Many people also think that the theme model is not very effective. Or it can be understood that the model is not as good as the corpus. That is, a high-quality corpus helps to improve the effect of article classification, no matter what model you use.
According to LSA model, a probability-based model PLSA is derived, which is essentially a generative model.
When we talked about statistics before, we thought there were two schools of statistical credits:
1, the frequency school of traditional statistics;
2. Bayesian school.
In fact, frequency school in traditional statistics, we think it lacks prior conditions.
And Bayes thinks that everything in the world is decided by something before.
So Bayesian has to consider many prior conditions. That is, p (a | b) = p (a) × p (b | a)/p (b); Bayesian should first consider posterior condition B, and then consider the probability of event A;
LDA model is a thematic model obtained by adding some prior conditions to PLSA model.
The traditional way to judge the similarity between two documents is to look at the number of words that appear together in two documents, such as TF-IDF. This method does not take into account the semantic association behind words. There may be few or no words appearing together in two documents, but the two documents are similar.
For example, there are the following two sentences:
"Jobs left us."
"Will the price of Apple drop?
In fact, people with common sense of life know that after the death of the founder of a company, it means that the stock price will definitely fall. So the above two sentences are essentially about Apple, but none of them are the same. If we use the traditional word bag method to analyze, we will find that the similarity between the two articles is 0. In this case, we have to consider the theme model.
Topic model is a statistical model for discovering abstract topics in a series of documents. Intuitively speaking, if an article has a central idea, then there must be some specific words that will appear more frequently. For example, now that an article is about Apple, words such as "Jobs" and "IPhone" will appear more frequently; If an article is about Microsoft now, the words "Windows" and "Microsoft" will appear more frequently; But in reality, an article usually contains a variety of topics, and the proportion of each topic is different. For example, in an article, 10% is related to Apple and 90% is related to Microsoft, so the number of Microsoft-related keywords should be 9 times that of Apple.
Topic model is an automatic analysis of each document, counting the words in the document, and determining which topics are included in the current document and the proportion of each topic according to the statistical information.
Topic model is a modeling method for hidden topics in text, and each topic is actually the probability distribution of words in thesaurus.
Topic model is a generation model. Every word in the article is obtained through a process of "selecting a topic with a certain probability, and then selecting a word from this topic with a certain probability".
Simulate the writing ideas of freelance writers;
1. The author conceived many themes for writing an article → There is a 72% probability that he chose "Apple" as the theme to write an article → There is a 0.23% probability that he wrote this article with the word "basketball" as the beginning. → The second paragraph begins with the word "Iphone" with an 87% probability.
2. The author conceived many themes for writing articles → chose "basketball" as the theme to write articles with a 5% probability → wrote articles with the word "a certain star" with a 90% probability. → The second paragraph begins with the word "Iphone", with a probability of 0.035%.
Based on our understanding of society, we know that the probability of the word Iphone appearing in articles with Apple as the main topic is much higher than that of articles with basketball as the main topic. On the contrary, writing articles about basketball, the word Iphone may not appear at all.
When we want to generate a theme for the current article, we will look for it from the article library. For example, now is the offseason of NBA, and we think NBA news appears less in the press release article library. When Apple holds a new press conference, we think there will be more articles about Apple in the news article library.
In essence, the writing ideas of the two freelancers we just built are a Bayesian network.
Combining these basic concepts, let's look at the following formula:
Analytical formula:
First of all, how to find the joint probability of p (phrase, topic, article) = P(w, t, d)?
Topic model overcomes the shortcomings of traditional document similarity calculation methods in information retrieval, and can find semantic topics between words in massive data. Topic model plays an important role in natural language and given text search.
How do I generate a theme? How to analyze the theme of the article? This is the problem to be solved by the theme model.
Theme model-singular value decomposition matrix decomposition, LSA model