Current location - Education and Training Encyclopedia - Graduation thesis - NLP basics and summary
NLP basics and summary
A popular natural language processing library, with its own corpus, classification, word segmentation and other functions, is mostly used by foreign users, similar to the Chinese street fighter processing library.

A model that assigns probabilities to word sequences is called a language model.

Generally speaking, a language model is a model that can calculate the probability that any word sequence is a sentence. Or the language model can predict what the next word in the word sequence will be.

** n metalanguage model * *

N-gram model is a typical statistical language model and a discriminant model based on probability. Statistical language model regards a language (a word sequence) as a random event, and gives the corresponding probability to describe the possibility that it belongs to a certain language set. Given a vocabulary set V, for a word sequence in V, S =? w 1,,wT? ∈ Vn, the statistical language model gives this sequence a probability P(S) to measure the confidence that S conforms to the grammatical and semantic rules of natural language. Simply put, statistical language model is such a model to calculate sentence probability.

N-gram model can alleviate the problem caused by the word sequence not appearing in the training set, that is, the data sparsity problem.

N-element model problem

These two pages of ppt are very clear about the problem of n-gram model.

The N-gram model is based on the assumption that the appearance of the current word is only related to the previous N- 1 words, but has nothing to do with any other words, and the probability of the whole sentence is the product of the appearance probability of each word. These probabilities can be obtained by counting the number of simultaneous occurrences of n words directly from the corpus. Binary grammar (N=2) and ternary grammar (n = 3) are usually used. Binary grammar satisfies Markov hypothesis.

Commonly used n-ary models include binary model and ternary model. They are expressed as follows:

Binary grammar: p (t) = p (w1| begin) p (w2 | w1) p (w3 | w2) * * p (wn | wn-1)

Ternary: P(T)=p(w 1|begin 1, begin2) p(wn-2|w 1, begin1) p (w3 | w2w1) * *.

Pay attention to the calculation method of the above probability: P(w 1|begin)= all sentences/total sentences starting with w 1 P (w2 | w 1) = w 1, and the number of simultaneous occurrences of w2 /w 1. And so on.

Please give an example of these calculations:

As can be seen from the above, begin in the two-letter calculation formula generally needs to be added with a.

Problems in n-gram grammar;

Give a few examples to illustrate: suppose we have a corpus (attention corpus), as follows:

Rats are really annoying. Rats are ugly. You love your wife, and I hate rats.

I want to predict the next word of the sentence "I love the old". We use binary model and ternary model to forecast respectively.

1) Through bigram, you need to calculate P(w| Lao). According to statistics, "mouse" appears three times, and "wife" appears 1 time. Through maximum likelihood estimation, P (mouse | old) =0.75, P (old woman | old) =0.25.

2) Through the ternary model, it means to calculate P(w| love the old). According to statistics, Only My Wife appeared 1 time. Through the maximum likelihood estimation, P (female | love the old) = 1, so the whole sentence we predicted through the ternary model is: I love my wife. Obviously, this prediction result is more reasonable.

Question 1: With the increase of n, we have more prior information and can predict the next word more accurately. But this also brings a problem. When n is too large, it is easy to happen that some n-ary graphs never appear, resulting in many prediction probability results of 0, which is the sparsity problem. In practice, only binary model or ternary model is often used. This problem can be alleviated by smoothing. Reference:/s/nvwb9h71juifyl _ or _ ENA)

Problem 2: At the same time, due to the last sparsity problem, N-gram can't obtain the long-term dependency of the context.

Problem 3: n-gram does statistics based on frequency, and its generalization ability is insufficient.

N-gram summary: Statistical language model is to calculate the probability value of a sentence, and the probability of the whole sentence is the product of the occurrence probability of each word. The greater the probability value, the more reasonable the sentence. N-gram is a typical statistical language model, which makes an assumption that the appearance of the current word is only related to the first N- 1 words, but has nothing to do with any other words, and the probability of the whole sentence is the product of the appearance probability of each word. There are many problems in this. When calculating the probability of each word, with the increase of n, there can be more prior information, which can make the prediction of the current word more accurate. However, if n is too large, sparsity will occur, resulting in the probability value of many words being 0. In order to solve this problem, binary model or ternary model is usually used, which leads to the long-term dependence of n-ary model. On the other hand, N-gram is only based on frequency statistics and does not have sufficient generalization ability.

Neural network language model

Bengio put forward in 2003 that the idea of Neural Network Language Model (NNLM) is to put forward the concept of word vectors, replace ngram with discrete variables (high dimensions) and distribute words with continuous variables (real vectors with certain dimensions), thus solving the problem of dimension explosion, and at the same time, the similarity between words can be obtained through word vectors.

As can be seen from the following figure, the task of the language model it establishes is to predict the next word according to the above window size, so from another perspective, it is an n-gram model coded by neural network.

It is the simplest neural network and consists of only four layers: input layer, embedded layer, hidden layer and output layer. (On the other hand, it is an N-ary model coded by neural network. )

The input is the index sequence of the word sequence. For example, the index of the word "this" in the dictionary (size ∣V∣) is 10, the index of the word "yes" is 23, and the index of "test" is 65, so the sentence "this is a test" has passed the prediction "try" in the window size. The embedding layer is a matrix with the size of ∣V∣×K (note: the size of k is set by yourself, and this matrix is equivalent to a randomly initialized word vector, which will be updated in bp, and this part is the word vector after the neural network training is completed). Take out the 10, 23, 65 line vectors to form a 3×K matrix. The hidden layer accepts the output of the mosaic embedded layer as input, uses tanh as activation function, and finally sends it to the output layer with softmax. The output probability is optimized to maximize the corresponding softmax value of the word to be predicted.

Disadvantages: Because the language model is trained by feedforward neural network, the obvious disadvantage is that there are too many parameters in it and the calculation of softmax is too large. On the other hand, NNLM is an N-ary model of neural network intuitive coding, which cannot solve the problem of long-term dependence.

RNNLM

It trains the language model through RNN and its variant network, and its task is to predict the next word through the above. Compared with NNLM, it has the advantage of using RNN, which has natural advantages in processing sequence data. RNN network breaks the limitation of context window and summarizes all historical context information with the state of hidden layer. Compared with NNLM, it can capture longer dependencies and get better results in experiments. RNNLM has fewer hyperparameters and stronger versatility; However, due to the gradient dispersion problem of RNN, it is difficult to capture the dependence information in a longer distance.

CBOW and skip-gram in Word2vec, where CBOW predicts the central word through the context within the window size, while skip-gram predicts the context within the window size through the input central word.

Glove is a statistical language model, which trains word vectors through statistical knowledge.

ELMO uses multi-layer bidirectional LSTM (generally using two layers) to train the language model, and its task is to predict the current word by using the context. The above information is obtained through the forward LSTM, and the following information is obtained through the reverse LSTM. This two-way is a weak two-way, so no real context information is obtained.

GPT trains the language model through Transformer. The language model it trains is unidirectional, so it can predict the next word through it.

Bert trained MLM through Transformers, a true two-way language model. The language model it trains is to predict the current word according to the context.

The detailed introduction of the above parts is mentioned in NLP's pre-training articles.

Evaluation index of language model

Reference position method, SVD) to find the low-order approximation of the matrix.

Probabilistic Latent Semantic Analysis (PLSA) Model

Probabilistic Latent Semantic Analysis (PLSA) model is actually put forward to overcome some shortcomings of Latent Semantic Analysis (LSA) model. A fundamental problem of LSA is that although we can regard each column of U k and V k as a topic, because the values of each column can be regarded as almost infinite real values, we can't further explain what these values mean, and we can't understand this model from the perspective of probability.

PLSA model gives LSA a probabilistic explanation by generating the model. The model assumes that every document contains a series of possible potential topics, and every word in the document is not generated out of thin air, but is generated according to a certain probability under the guidance of these potential topics.

In PLSA model, the topic is actually the probability distribution on a word, each topic represents the probability distribution on a different word, and each document can be regarded as the probability distribution on a topic. Each document is generated by such a two-level probability distribution, which is the core idea of the generation model proposed by PLSA.

PLSA simulates the joint distribution of d and w by the following formula:

The *z * number in this model is a superparameter that needs to be given in advance. It should be noted that two expressions of P (w, d) are given in the above formula. In the former formula, *d * and w are both generated by conditional probability under the premise of given *z *, and their generation methods are similar, so they are' symmetric'; In the latter formula, first give D, then generate a possible topic Z according to P (z | d), and then generate a possible word W according to P (w| z). Because the generation of words and documents in this formula is not similar, it is' asymmetric'.

The above figure shows the representation of asymmetric plate symbols in PLSA model. Where d represents the document, z represents the theme generated by the document, and w represents the words generated by the theme. In this model, D and W are observed variables, and Z is unknown variables (representing potential topics).

We can easily find that for a new document, we can't know what its corresponding P (d) is, so although PLSA model is a generation model on a given document, it can't generate a new unknown document. Another problem of this model is that with the increase of the number of documents, the parameters of P (z | d) will increase linearly, which leads to the over-fitting problem of the model no matter how many training data there are. These two points have become two major defects that limit the PLSA model to be more widely used.

Potential Dirichlet analysis model

In order to solve the over-fitting problem in PLSA model, Blei et al. put forward the Potential Dirichlet Distribution (LDA) model, which has also become the most widely used model in the field of topic model research. LDA is a Bayesian framework based on PLSA, that is, LDA is a Bayesian version of PLSA (because LDA is a Bayesian, it is necessary to consider historical prior knowledge and add two prior parameters).

From the last section, we can see that in the PLSA model, we know nothing about P (d) for an unknown new document D, which is actually not in line with human experience. In other words, it does not use the information that could have been used, which is the so-called prior information in LDA.

Specifically, in LDA, first of all, each document is regarded as having more or less relevance to each of a limited number of given topics, and this relevance is described by the probability distribution on the topics, which is actually consistent with PLSA.

However, in LDA model, the probability distribution of each document on the topic is given a prior distribution, which is generally expressed by sparse Dirichlet distribution. This sparse form of Dirichlet's prior can be regarded as a kind of prior knowledge that encodes human beings: generally speaking, the theme of an article is more focused on a few topics, and it is rarely said that an article covers many topics at the same time without obvious focus.

In addition, the LDA model also gives the sparse Dirichlet prior of the probability distribution of a topic on all words, and its intuitive explanation is similar: in a single topic, in most cases, a small number of words (highly related to this topic) will appear with high frequency, while the frequency of other words is obviously low. These two priors enable LDA model to describe the relationship among documents, topics and words better than PLSA model.

In fact, from the results of PLSA, it is actually equivalent to transforming the prior distribution in LDA model into uniform distribution, and then obtaining the maximum posterior estimation of the required parameters (on the premise that the prior is uniform distribution, this is also equivalent to obtaining the maximum likelihood estimation of the parameters), which also reflects that a more reasonable prior is very important for modeling.

Word segmentation is the process of regrouping continuous word sequences into word sequences according to certain norms.

The existing word segmentation algorithms can be divided into three categories: word segmentation based on string matching, word segmentation based on understanding and word segmentation based on statistics.

According to whether it is combined with the part-of-speech tagging process, it can be divided into simple word segmentation method and comprehensive method combining word segmentation and tagging.

Chinese word segmentation is mainly divided into the following two categories according to the realization principle and characteristics:

Dictionary-based (1) Word Segmentation Algorithm

Also known as string matching word segmentation algorithm. The algorithm matches the character string to be matched in the established "big enough" dictionary according to a certain strategy. If an entry is found, it means that the match is successful and the word is recognized. Common dictionary-based word segmentation algorithms are divided into the following categories: forward maximum matching method, reverse maximum matching method and bidirectional matching word segmentation method.

The word segmentation algorithm based on dictionary is the most widely used and the fastest. For a long time, researchers have been optimizing methods based on string matching, such as setting the maximum length, storing and searching strings, and organizing thesaurus, such as using TRIE index tree and hash index.

(2) Machine learning algorithm based on statistics

At present, such commonly used algorithms include HMM, CRF (conditional random field), SVM, deep learning and so on. For example, the word segmentation tools of Stanford and Hanlp are all based on CRF algorithm. Taking CRF as an example, the basic idea is to label Chinese characters, which considers both word frequency and context, and has good learning ability, so the recognition effect of ambiguous words and unknown words is better.

Common word splitters use machine learning algorithms and dictionaries, which can improve the accuracy of word segmentation on the one hand and domain adaptability on the other.

With the rise of deep learning, word classifiers based on neural networks have also appeared. For example, someone tried to implement a word classifier with two-way LSTM+CRF, which is essentially sequence labeling, so this model can be used universally, named entity recognition and so on. It is reported that the accuracy of the word classifier can reach 97.5%. The idea of this algorithm framework is similar to the paper named entity recognition neural architecture, and Chinese word segmentation can be realized by using this framework, as shown in the following figure:

Firstly, the characters are embedded in the corpus, and the obtained features are input into the bidirectional LSTM, and then the conditional random field is added to get the labeling results.

At present, there are three main difficulties in Chinese word segmentation:

1, word segmentation standard: for example, in the standard of Harbin Institute of Technology, the first name and last name are separated, but they are combined in Hanlp. Therefore, it is necessary to formulate different word segmentation standards according to different needs.

2. Ambiguity: the same character string to be segmented has multiple segmentation results.

Ambiguity can be divided into three types: combination ambiguity, intersection ambiguity and true ambiguity.

Usually, in search engines, different word segmentation algorithms are used when indexing and querying. The common scheme is to use fine-grained word segmentation to ensure recall when indexing, and coarse-grained word segmentation to ensure accuracy when querying.

3. New words: also known as words not included in the dictionary. The solution of this problem depends on people's further understanding of word segmentation technology and Chinese language structure.

A typical text classification process can be divided into three steps:

1. Text representation

The purpose of this process is to express the text into a form that the classifier can handle. The most commonly used method is vector space model, which represents the text set as a word-document matrix, and each element in the matrix represents the weight of a word in the corresponding document. The process of choosing which words to represent text is called feature selection. Common feature selection methods include document frequency, information gain, mutual information and expected cross entropy. In order to reduce the amount of calculation in the classification process, dimension reduction processing, such as LSI, is usually needed.

2. Classifier construction.

The purpose of this step is to choose or design a method to construct the classifier. Different methods have their own advantages and disadvantages and applicable conditions, and the classifier should be selected according to the characteristics of the problem. We will talk about the common methods in detail later. After selecting the method, a classifier is established for each category on the training set, and then the classifier is applied to the test set to get the classification result.

3. Effect evaluation (classifier evaluation)

After the classification process is completed, it is necessary to evaluate the classification effect. The evaluation process is applied to the text classification results on the test set (not the training set), and the commonly used evaluation criteria are inherited from the IR field, including recall rate, accuracy, F 1 value, etc.

1. Rocchio method

Each category determines a centroid, and calculates the distance between the document to be classified and various representative elements, which is used as a criterion to judge whether it belongs to this category. Rocchio method is characterized by simple implementation and high efficiency. The disadvantage is that it is influenced by the distribution of text sets, for example, the calculated center point may fall outside the corresponding category.

2. Naive Bayes (Na? Bayesian method

Applying probability theory model to automatic document classification is a simple and effective classification method. Bayesian formula estimates the posterior probability of a document to a certain category through the prior probability and conditional probability of the category, so as to realize the judgment of the category to which the document belongs.

3.k- nearest neighbor (KNN) method.

K nearest neighbors (documents) closest to the document to be classified are found from the training set, and the category of the document to be classified is determined according to the categories of these K nearest neighbors. The advantage of KNN method is that it does not need feature selection and training, and it is easy to handle a large number of categories. One of its disadvantages is its high spatial complexity. The classifier obtained by KNN method is a nonlinear classifier.

4. Support Vector Machine (SVM) method

For a certain category, find a classification surface, so that the positive examples and counterexamples of the category fall on both sides of the classification surface, and the classification surface meets the following requirements: the distance to the nearest positive examples and counterexamples is equal, and it is the classification surface with the largest distance from the positive examples (or counterexamples) among all classification surfaces. The advantages of SVM method are less training sets and less calculation. The disadvantage is that it relies too much on the position of positive examples and counterexamples near the classification surface and is paranoid.

Text clustering process can be divided into three steps:

1. Text representation

Represent documents in a form that clustering algorithms can handle. Please refer to the text classification section for the technology used.

2. Selection or design of clustering algorithm.

The choice of algorithm is often accompanied by the choice of similarity calculation method. In text mining, the most commonly used similarity calculation method is cosine similarity. There are many clustering algorithms, but none of them can solve all the clustering problems. Therefore, it is necessary to carefully study the characteristics of the problem to be solved in order to choose the appropriate algorithm. Various text clustering algorithms will be introduced later.

3. Cluster evaluation.

Select a document set that has been manually classified or marked as a test set. After clustering, the clustering results are compared with the existing manual classification results. Commonly used evaluation indicators include recall, precision and F 1 value.

1. Hierarchical clustering method

Hierarchical clustering can be divided into two types: aggregated hierarchical clustering and divided hierarchical clustering. Cohesion method takes each text as an initial cluster, and after continuous merging process, it finally becomes a cluster. The process of dividing methods is just the opposite. Hierarchical clustering can get hierarchical clustering results, but it has high computational complexity and cannot handle a large number of documents.

2. Division

K-means algorithm is the most commonly used partition method. Given k clusters, k texts are selected as k initial clusters, other texts are added to the nearest cluster, the center point of the cluster is updated, and then the texts are re-divided according to the new center point; When the clustering no longer changes or after a certain number of iterations, the algorithm stops. K-means algorithm is low in complexity and easy to implement, but it is sensitive to abnormal and noisy texts. Another problem is that there is no good way to determine the value of k.

3. Density-based approach

In order to find clustering results with arbitrary shapes, a density-based method is proposed. This method regards clusters as high-density areas separated by low-density areas in data space. Common density-based methods include DBSCAN, OPTICS, DENCLUE, etc.

4. Neural network method

Neural network method describes each cluster as a sample, and the sample, as the "prototype" of the cluster, does not necessarily correspond to a specific data. According to some distance metrics, new objects are assigned to their most similar clusters. The well-known neural network clustering algorithms are competitive learning and self-organizing mapping [Kohonen, 1990]. Neural network clustering method needs long processing time and complex data complexity, so it is not suitable for clustering big data.