Natural language processing-topic modeling

Goal: Given a set of documents, divide them into different topics.

Bag-of-words method attempts to directly use words appearing in the dataset to represent documents in the dataset, but usually these words are based on some underlying parameters, which are different between different documents, such as the topic of discussion. In this part, we will discuss these hidden or potential variables, then call the specific skills we have learned to estimate their potential Dirichlet distribution, and finally use LDA to model the topic.

If we consider the word bag model graphically, we can find that it represents the relationship between a group of document objects and a group of word objects.

Suppose there is such an article, it depends on the text in the article. Let's assume that we have done stem extraction and text processing correctly, and only look at important words. There are three words in this article, space, voting and exploration, which appear twice, 1 time and three times respectively. To calculate the probability of each word appearing in the article, we divide it by the total number of words to get the probability, thus obtaining three parameters.

What is the probability that a document will generate words for any given document and observed items?

Give some documents and some words as follows, and mark all these probabilities. Suppose there are 500 documents and 1000 unique words, how many parameters does this model have?

According to the fact that the above single document contains three words to generate three parameters, then 1000 unique words generate 1000 parameters for each document, that is, 500× 1000 parameters for 500 documents.

So in general,

But for our example, there are too many calculated parameters, so we need to reduce this number and keep most of the information. To this end, we will add a small group of topics or latent variables to the model, which actually promote the generation of words in each document, that is, the words obtained in each document should be related to the latent variables. In this model, any document is considered to have a series of potential related topics. Similarly, topics are considered to be composed of possible words. For example, in the following example, we mentioned three themes, science and politics and sports. Now we need to calculate two sets of probability parameter distributions.

The first is the probability of the topic under the document, that is

The second is the probability of words under the theme, that is

The probability of a word item in a document can be expressed as the sum of the first two probabilities, that is, as follows

Thinking: If there are 500 documents, 10 topics and 1000 words, how many parameters does the new model have?

By introducing latent variables, we find that the parameters are reduced from the previous 50W to 1.5W, which is called potential Dirichlet distribution, and LDA is abbreviated as matrix decomposition. We'll know how to forget it

Principle: From the bag-of-words model on the left to the LDA model on the right, the bag-of-words model on the left indicates that the probability of the second document generating the word tax is the white arrow label in the figure below.

In the LDA model on the right, this probability can be calculated by these white arrows, multiplied by the corresponding one at the top, and then summed. This formula can be multiplied by a matrix.

We can write the probability in the left model as a large matrix, and then write this large word package matrix as the product of this tall and thin matrix indexed by documents and topics and this wide and flat model indexed by topics and terms.

In this example, the second document in the word package matrix and the word "tax" correspond to an entry, which is equal to the inner product of the corresponding row and column of the matrix on the right.

As before, if these matrices are large, such as 500 documents, 10 topics and 1000 words, then the word package matrix has 500,000 entries, while the two matrices in the topic model have 15000 entries after merging.

Besides simplification, LDA has a bigger advantage. It has a large number of topics, which enable us to split the document according to these topics. Here, we call these topics science, politics and sports. In real life, the algorithm will directly adopt some topics, and we will look at related words and decide what the same topic of all these words is.

The principle of constructing this LDA model is to decompose the word bag matrix on the left into two matrices, and the indexes of the two matrices are documents and topics, topics and words respectively.

The meaning of these matrices will be introduced in detail below.

The word package matrix is calculated as follows. Suppose that for document 2(Doc 2) containing the word space 3 times, climate and rule give 1 time, and other words have been deleted as stop words. By writing these numbers in the corresponding lines,

The probability to be calculated is directly divided by the sum of the items in the row. For Doc 2, the total number of items is 5, which is the bag matrix.

Now calculate the document theme matrix, assuming that Doc 3 is mainly about science, involving some sports and politics, assuming that 70% is about science, 10% is about politics and 20% is about sports, and record these figures in the corresponding lines.

The matrix of subject words is similar. There is a topic here, which is politics. Suppose we can get the probability that this topic will produce some words. The sum of these probabilities should be 1, and put them in the corresponding lines.

It can be seen that the product of these two matrices is a package matrix, which is not completely equal, but very close. If we can find that the product of two matrices is very close to the package matrix, we will create a theme model.

The way to find this matrix

One is the traditional matrix decomposition algorithm.

Note: The sum of the rows of the matrix is 1.

In a set of documents, there are many topics and word structures. We adopt a more complicated method than matrix multiplication. The basic principle is that the items of the two thematic modeling matrices come from special distribution diagrams. According to this fact, we will use these distributions to get these two matrices.

Suppose there is a party and the venue is a triangular room. These black dots represent party member walking in the scene. Suppose there is some food in one corner, some candy in another corner and music in another corner.

People are attracted by these corners and start walking towards them. Some people like music, while others are the opposite. The white dots on the left represent people who don't know whether to choose food or dessert, so they stay in the middle, but usually, they tend to go to the red area and leave the blue area.

Suppose we create the opposite thing. We put a lion in one corner, fire prevention in another corner and radioactive waste in the other corner. Now people will do the opposite, they will stay away from the corner, be attracted to the central area and leave the blue area, so we have three situations, putting attractive things in the corner, putting nothing and putting dangerous things. These examples are called Dirichlet distribution.

In these triangles, the probability that a point is in the red area is higher than that in the blue area.

Dirichlet distribution has parameters on the angle. If the parameters are very small, such as 0.7, 0.7, 0.7, it indicates the left situation; If they are all 1, explain the intermediate situation; If they are very large numbers, such as 5, it means the correct situation.

We can regard these parameters as exclusion factors. If they are big, they will push the point away; If they are small, they will bring the points closer.

For example, our theme model has three themes, sports, science and politics.

Then it is obvious that the left side of the above three models is more suitable for generating our theme model. In the distribution on the left, we are likely to choose a point near the corner or edge, such as politics, which shows that the article is 80% about politics, 65,438+00% about sports and 65,438+00% about science.

In the middle distribution, we can choose any point with the same probability. For example, this document is 40% about science, 40% about politics and 20% about sports.

In the distribution on the right, we are likely to choose the middle point, such as this document. The probability of science, sports and politics is almost equal.

All the documents we choose will be like this, and they will be points in these probability distributions. Let's think about it. If there are a large number of articles, do you prefer to focus on one topic or three topics at the same time? Most articles only focus on one thing, either science, sports or politics. Few articles focus on two topics, and few articles focus on three topics at the same time. So the most likely topic distribution is left distribution.

What we usually do is, for LDA model, we will choose a Dirichlet distribution with very small parameters, such as 0.7, 0.7, 0.7, and then we will choose some points as documents, and each point will give a mixed probability vector, which describes the topic distribution of the document.

The following is a three-dimensional diagram of Dirichlet distribution. The probability of choosing a point on a triangle depends on the probability distribution height of the point.

Let's discuss the probability distribution.

Suppose there is a coin, what is the probability of throwing it twice on both sides, once on the front and once on the back? It may be fair or slightly forward or backward, and the data is insufficient to judge.

Suppose we think it's fair, but we're not sure that the probability p of this coin facing up keeps a certain degree of confidence. P is 1/2, but it may be other values, so the probability distribution of P is higher at 1/2, but more uniform in the whole interval, and very low at 0 angle and 1 angle.

Suppose you toss a coin 20 times, 10 times, and the heads are 10 times. Now you are more sure that this coin is fair. The distribution of the new value p is more like this figure, and the peak value of 0.5 is higher.

Assuming that the probability distribution of coin toss is four times, three times, heads and 1 time, P is now centered at 0.75, because after four attempts, three times are heads but the confidence level is not high, so it is such a graph.

But if we throw it 400 times and face up 300 times, we are confident that the value of p is very close to 0.5, then the probability distribution of p is like this. There is a very high peak at 0.75, and everything else is almost flat, which is called β distribution.

Suitable for any value of a and b, if the front is up a times and the back is up b times, the graph is like this.

The peak formula of is as follows: divide by x times the power of a- 1, and then multiply by y times the power of b- 1.

If you haven't seen the gamma function, you can think of it as a continuous version of the factorial function.

In this example, if a is an integer, it is the factorial of a- 1 For example, the factorial result of 4 is 24 and the factorial result of 5 is 120. The interesting thing about the gamma function is that it can also use decimal points to calculate values. For example, it can be calculated that the result will be between 24 and 120.

What if the value is not an integer? For example, the front is 0. 1 and the back is 0.4, which is unreasonable, because it is impossible to have 0. 1 before and after 0.4, but it is possible for β distribution.

We just need to use the gamma function to draw the correct formula. For values less than 1, such as 0. 1 and 0.4, this graph shows that p is more likely to approach 0 or 1 than the median value.

Polynomial distribution is the generalization of binomial distribution to multi-values.

For example, suppose there are news reports and three topics are science, politics and sports. Suppose each topic is randomly assigned to these articles, including three scientific articles, six political articles and two sports articles. What is the probability that a new article is a scientific, political or sports article?

It is conceivable that it is a political science article with a high probability, but it is a scientific sports article with a high probability, which is uncertain but very likely to happen.

These probabilities lie in a triangle. If you choose a point in the corner, the probability of this topic is 1, and the probability of other topics is 0. If you choose a point in the middle, the probability of the three questions is almost the same.

So the probability density distribution of these three probabilities may be such a graph.

Because articles are more likely to be political articles than sports articles.

This probability density distribution is calculated by the following extended formula of β distribution, which is called Dirichlet distribution.

Again, these numbers are not necessarily integers, such as 0.7, 0.7, 0.7, which is the Dirichlet distribution.

When we are very close to the corner of the triangle, the density function becomes very high, although it is not easy to see this law, which shows that any point randomly selected from this distribution is very likely to fall into the corner where science, politics or sports are located, at least near any side, and the possibility of being in the middle is very low.

These are examples of Dirichlet distributions with different values.

note:

If the value is large, the density function is high in the middle.

If the value is small, it is high in the corner.

If the values are different, the higher part moves to a smaller value, leaving a larger value.

This is a three-dimensional figure. If we want to create a good theme model, we need to choose very small parameters, such as the left picture. Then we choose the theme like this.

Now establish LDA model and introduce its principle.

These are our documents.

Suppose there are three documents on the right, and then some fake documents are generated, such as the three documents on the left. We use the theme model to generate these documents, and then compare the generated documents with the actual documents, so that we can judge the accuracy of the model to create real documents. Like most machine learning algorithms, we learn from these mistakes and improve our topic model.

The way is, in the paper, the theme model is drawn like this, which looks more complicated.

Next, let's analyze the above picture in detail.

Let's choose some topics for the document first, starting with the parameter Dirichlet distribution, and the parameter should be small, so that the distribution appears on a certain side, that is to say, if we choose a point from the distribution, it is most likely to be near the corner or at least near the edge.

Suppose we choose this point near the political corner to generate the following values: science 0. 1, politics 0.8, sports 0. 1. These values represent the mix of topics in this document and give multiple distributions. Therefore, the probability that the selected theme is science is 10%, politics is 80%, and sports is 10%. We will choose some topics, such as political science on the far right.

Do this for multiple documents, each of which is a point in this Dirichlet distribution.

Suppose that document 1 gets this polynomial distribution here, and document 2 gives this distribution here.

Calculate the distribution value of all other documents.

Now combine all these vectors to get the first matrix, which is the matrix that indexes the documents corresponding to the topic.

Now do the same for the questions and words.

For the convenience of visualization, it is assumed that there are only four words of space climate voting and rules, and now there are different distributions. This distribution is similar to the last one, but the three-dimensional distribution is not a triangle but a simplex. Similarly, the red part indicates high probability and the blue part indicates low probability. If there are more words, it is still a simplex with very clear distribution but higher dimensions. So we chose four words, which can be represented by a three-dimensional graph.

In this distribution, we choose a random point, which is likely to be near the corner or edge. Suppose this point produces the following polynomial distribution: space is 0.4, climate is 0.4, voting and rule is 0. 1. We call it polynomial distribution, which represents the relationship between words and topics. We randomly select words from this distribution. The probability of words being space is 40%, the probability of climate is 40%, and the probability of voting and rules is 10%. The word may be on the far right.

Let's do this for each topic, assuming that the location of topic 1 is close to space, and the location of topic 2 of climate is close to topic 3 of voting and close to rules.

Note that we don't know what they are, only that they are the theme 1, 2, 3. Through observation, we can infer that the theme 1 close to space and climate must be science, the theme 2 close to voting may be politics, and the theme 3 close to domination may be sports, which is the last operation of the model.

In the last step, we combine these three topics to get other matrices in LDA model.

Now, combine these together and study how to get these two matrices in LDA model according to their respective Dirichlet distributions.

The usual way is what we see. The entries in the first matrix are from points in the distribution, and the entries in the second matrix are from points in the distribution β. The goal is to find the best positions of these points to obtain the best decomposition result of the matrix, and the best positions of these points will enable us to obtain the desired theme.

We generate some documents to compare with the original documents. We start with the Dirichlet distribution of topics, and then select some points corresponding to all documents. We first select one of the points, which will provide some values for the three topics and generate multiple distributions.

This is a mixed result of topics corresponding to the document 1. Now, generate some words for the document 1 How many topics do we choose from? This is beyond the scope of this course. The principle is to judge how many topics to choose according to Poisson variable, which can be directly used as another parameter in this model.

We choose some topics according to the probability provided by this distribution. The probability of choosing science is 0.7, politics is 0.2, and sports is 0. 1. Now, connect the words with these topics. How come?

Using Dirichlet distribution of words, we determine that the topic obtains the distribution of words produced by each topic from every point in a certain position.

For example, the probability of generating a word space for the theme 1 science is 0.4, the probability of generating a climate is 0.4, the probability of generating a vote is 0. 1, and the probability of generating a rule is 0. 1. These distributions are called

For each selected topic, we will use multivariate distribution to select a related word, for example, the first topic is science, and we will look at the line where science is located in the distribution and choose a word from it, for example, space is the first word in the document 1.

We perform this process for each topic, and then repeat it by generating words from the first generated document (called "fake document 1").

Choose another point from the distribution, get another multivariate distribution, generate a new topic and generate a new word from it. It is "fake Doc 2" and continues to generate many documents.

Now compare these documents with the original documents, and we will use the maximum likelihood method to find a series of points with the highest probability of obtaining real articles.

To sum up the specific process, we have two Dirichlet distributions. We choose some documents from the distribution and some topics from the distribution, combine the two to create false articles, and then compare these false articles with real articles. Of course, the probability of getting a real article is very small, but there must be some combinations of points on the distribution to maximize this probability. Our goal is to find the combination of this point and get the topic, just like training many algorithms in machine learning.

There must be errors in this process, which can tell us how far away from generating real articles. The error will propagate back to the distribution to obtain a gradient, which can tell us where to move the point to reduce the error. We will repeat this process according to the prompt to move the points, so as to get a slightly better model now and get a good point combination.

Usually a good combination of points will provide us with some topics. Dirichlet distribution will tell us which articles are related to these topics, and Dirichlet distribution will tell us which words are related to these topics.

We can further reason and propagate the error back, not only can we get a better combination of points, but also can get a better distribution sum. This is the principle of potential Dirichlet analysis.

If you want to know the chart in the newspaper, here is the chart.

We just want to know

It's theme distribution

This is the word distribution

And is a multivariate distribution obtained from these two distributions.

This is the theme.

It is a document that combines the two.

Let or not let argumentative papers?

What wonderful music does the 500-word paper bring us?

Re-take Xi 'an's red footprint

Why does People's Daily think that renting and purchasing at the same time is a good choice?

Argumentative essay in the opposite direction

Present situation of pull-back shoes

Senior one geography composition consultation.

The value conflict of law and its coordination

What is the realistic basis of WiFi application in the Internet of Things?

Where is a good place to climb mountains in Beijing?