This paper introduces a word vector model, which is not a text classification model, but can be said to be the basis of fasttext. So let me mention it briefly.
The author thinks that cbow, skipgram and most of the word vector models do not consider the polymorphism of a word, but simply regard multiple forms of a word as independent words. For example, the different forms of like are like, like, like, like, like, like, and the meanings of these words are actually the same, but the cbow/skipgram model thinks that these words are independent and does not consider their morphological diversity.
Therefore, the author puts forward an N-ary word vector model which can effectively use the word character level information, and implements the model in skipgram mode. For example, the word where, its n-gram is expressed as
In terms of loss, this paper adopts the strategy of negative sampling+binary logistic regression. That is, each target word is predicted as one of positive and negative.
This paper proposes a text classification model based on neural network, which is based on cbow and very similar to cbow.
Like CBOW, fastText model has only three layers: input layer, hidden layer and output layer (hierarchical Softmax). Input is a number of words represented by vectors, and output is a specific target. The hidden layer is the superposition and average of multiple word vectors. The difference is that the input of CBOW is the context of the target word, and the input of fastText is the embedded representation of multiple words and their n-gram features, which is used to represent a single document. The input words of CBOW are coded by onehot, and the input feature of fastText is embedding. The output of CBOW is the target vocabulary, and the output of fastText is the category tag corresponding to the document. The implementation of the output layer also uses hierarchical softmax. Of course, if you realize it yourself, you can use softmax directly for tasks with a small number of categories.
Finally, a simplified version of Keras model fasttext is posted.
On the basis of word vector representation, a convolutional neural network is proposed to classify texts. The algorithm is shown in the above figure:
In this paper, the author tries a variety of different word vector patterns:
In the last article, the input of CNN network is generally a pre-trained word vector, but in this paper, the author puts forward a model method that can effectively extract/retain word order information by directly combining embedding training with classification tasks, that is, effectively training n-gram, which can also be understood as a method of CNN embedding.
In addition, another problem is the change of input sequence length (solved by padding in the last article textCNN? ), the author of this paper proposes to solve this problem by using dynamic variable pool layer to make the output size of convolution layer the same. In fact, the variable pool is similar to the spatial pyramid pool in image recognition.
This article has the feeling of combining fastText with TextCNN, combining n-gram embedding and classification tasks for training, and embedding through CNN.
Text Classification by Region Embedding "
In this paper, the author proposes a tv embedding (that is, two-view embedding), which also belongs to regional embedding (which can also be understood as ngram embedding). This method is similar to the bow-CNN notation above. Use bow (word package) to represent words and phrases in an area, and then predict the areas before and after it (words or phrases in the left and right neighborhoods), that is, the input area is view 1 and the target area is view2. Tv-embedding is trained separately and combined with embedding in CNN (forming multiple channels? )。 The author thinks that the embedding vector pre-trained by word2vec method is universal, and the tv-embedding obtained by training the data set of a specific task has some task-related information, which is more conducive to improving our model effect.
I don't understand this article very well, or maybe my English is too poor. There is not a clear network diagram in the author's article, such as the diagram of textCNN, which is clear at a glance. Take a look and you'll know how to do it.
A text classification model based on LSTM is proposed, which uses supervised learning and semi-supervised pre-training. The author of the article is the same as above, so many technologies used can be said to be the same. So briefly talk about some ideas of this article.
The author thinks that the existing method of directly using LSTM as a text classification model and directly using the last output of LSTM as a follow-up full-connection classifier faces two problems: (1) This method generally integrates word embedding (that is, the input of onehot enters LSTM through an embedding layer), but the embedding training is unstable and difficult to train; (2) It is inaccurate to directly represent the whole document with the last output of LSTM. Generally speaking, the words behind the LSTM input will occupy a great weight in the final output, but this is not always correct for the article representation. Therefore, the author improved these two points:
In fact, this paper can be regarded as the integration of TV embedded semi-supervised training and RCNN in front of the author, and it has a feeling that the operation is as fierce as a tiger, and it is 0-5 at first glance (because the author's experimental results are not much compared with the general CNN).
The author of this article is also the author of the first two articles on text classification using CNN. Therefore, this paper combines some methods proposed in the first two articles and uses deep convolution neural network. Specific details include:
For more details about DPCNN, please see the deep word-level text classification model from DPCNN.
A text classification model based on CNN+ attention is proposed. The author thinks that most of the existing text classification models based on CNN use fixed-size convolution kernel, so the learned representation is also a fixed N-gram representation, and this N is related to the size of CNN filter. In the semantic representation of sentences, ngram words that play an important role in different sentences are often different, that is, changing. Therefore, it is very important for the model to adaptively select the best n-gram for each sentence to improve the semantic representation ability of the model. Based on this idea, this paper proposes an adaptive model for choosing different N-gram expressions.
The model in this paper refers to DenseNet in CV in the subject structure, and extracts rich N-gram feature representation through dense connection in DenseNet. For example, we can learn not only f(x 1, x2, x3), but also f(x 1(x2, x3)), which is a multi-level and richer feature. The structure of the network mainly includes three parts: DenseCNN main network, attention module and finally full connection layer classification network. The following is a brief description of these three parts:
In this paper, dense connection and attention are used to automatically obtain the most important N-gram features of text semantics, and the effect is very good. But the disadvantage is that this network is more suitable for short texts. In this paper, the input text is filled. For different data sets, the maximum length is 50, 100, but for long text, this is obviously not enough. Therefore, for a long text, Han had better borrow it to not limit the input length.
A text classification method combining recurrent neural network (RNN) and convolutional neural network is proposed. Its structure is shown in the above figure, and the network can be divided into three parts:
Although it is a combination of RNN and CNN, it actually only uses pooling in CNN, which is a bit of a gimmick. The reason why RCNN is better than CNN is also mentioned, that is, why RCNN can capture context information better than CNN: CNN uses a fixed-size window (that is, the kernel size) to extract context information, which is actually an n-gram. Therefore, the performance of CNN is greatly affected by the window size. If it is too small, some long-distance information will be lost. If it is too large, it will lead to sparsity and increase the amount of calculation.
In many natural language processing tasks, a very prominent problem is the lack of training data and the difficulty of labeling. Therefore, this paper proposes a multi-task RNN model framework, which uses several different task data sets to train the parameters of the same model, and realizes the function of data set expansion.
The author of this paper puts forward three models, as shown in the above figure:
The training methods of the three modes are the same:
This paper proposes a hierarchical LSTM+ attention model. The author thinks that although an article consists of several sentences, some of them may really play a key role, so the attention mechanism is applied in each sentence to make the sentences that make greater contributions to the semantics of the article occupy greater weight. Similarly, there are many words that make up a sentence, but only a few of them may play an important role, so it is the core idea of this paper to use the attention mechanism to make important words play a greater role. The whole network can be divided into three layers, two LSTM layers are used for word coding and sentence coding respectively, and the top layer is the fully connected classification layer. If you add two layers of attention, you can imagine the network as five layers. Let me briefly talk about the structure of the five-layer network:
Generally speaking, this article looks very interesting and conforms to people's habit of reading articles. We also have central words and central sentences when writing articles. However, it is unknown whether this hierarchical structure will lead to slow training speed or poor training effect. Finally, the paper also proposes to sort the articles by length and enter a batch with similar length, which speeds up the training by three times.
A text classification method based on graphic neural network is proposed. The main idea of this method is to put all the articles and their vocabulary into a graph network. There are two kinds of nodes in graph network: text nodes and article nodes. Among them, the weights of edges connecting word nodes and article nodes are represented by TF-IDF, while the weights of edges between words are represented by point mutual information (PMI). Point mutual information is very similar to the calculation method of conditional probability in traditional language model. PMI adopts the sliding window method, and the conditional probability is directly counted in all corpora, which can be considered as a big window, and then it is the same as PMI.
A represents the adjacency matrix of graph network, as follows:
GCN can also contain multiple hidden layers, and the calculation method of each layer is as follows:
Where a' is a normalized symmetric adjacency matrix, w0 ∈ r (m× k) is a weight matrix, and ρ is an activation function, for example, ReLU ρ(x) = max(0, x). As described above, higher-order neighborhood information can be merged by superimposing multiple GCN layers:
Where j represents the number of layers.
The loss function is defined as the cross entropy error of all tagged documents:
Text GCN works well for two reasons:
But it also has some disadvantages:
Generally speaking, the idea of the article is quite interesting and the effect is not bad. When you first met GCN, it might still be a little hard to understand. You can refer to the following information for further study:
Text classification algorithm based on graph-convolution network
How to understand Graph Involuntary Network (GCN)?