Text classification application: common spam identification, sentiment analysis.
Text classification direction: there are mainly two kinds of classification, multi-classification and multi-label classification.
Text classification methods: traditional machine learning methods (Bayesian, svm, etc. ) and deep learning methods (fastText, TextCNN, etc. ).
The train of thought of this paper: This paper mainly introduces the processing process and main methods of text classification. Try to make readers understand what direction to start with when dealing with text classification, what problems to pay attention to, and what methods to adopt for different scenarios.
The processing of text classification can be roughly divided into text preprocessing, text feature extraction and classification model construction. Compared with English text processing and classification, Chinese text preprocessing is the key technology.
Chinese word segmentation is a key technology in Chinese text classification. Feature granularity is much better than word granularity, and most of its classification algorithms do not consider word order information, so N-gram based on word granularity loses too much information. Let's briefly summarize Chinese word segmentation techniques: word segmentation based on string matching, word segmentation based on understanding, and word segmentation based on statistics [1].
1, a word segmentation method based on string matching:
Process: This is a dictionary-based Chinese word segmentation, the core of which is to establish a unified dictionary table first. When a sentence needs word segmentation, the sentence is first divided into multiple parts, and each part corresponds to the dictionary one by one. If the word is in the dictionary, the word segmentation is successful, otherwise, continue to split and match until it is successful.
Core: Dictionary, word segmentation rules and matching order are the core.
Analysis: The advantages are high speed, time complexity can be kept at O(n), simple implementation and acceptable effect; However, it is not effective in dealing with ambiguous and unknown words.
2. Understanding-based word segmentation method: Understanding-based word segmentation method is to achieve the effect of identifying words by making computers simulate people's understanding of sentences. Its basic idea is to carry out syntactic and semantic analysis while segmenting words, and use syntactic and semantic information to deal with ambiguity. It usually includes three parts: word segmentation subsystem, syntax and semantics subsystem and general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain syntactic and semantic information about words and sentences to judge the ambiguity of word segmentation, that is, it simulates the process of people understanding sentences. This word segmentation method needs a lot of language knowledge and information. Because of the generality and complexity of Chinese language knowledge, it is difficult to organize all kinds of language information into a form that can be read directly by machines, so the understanding-based word segmentation system is still in the experimental stage.
3. Word segmentation method based on statistics:
Process: Statistically speaking, word segmentation is a problem of maximizing probability, that is, splitting sentences. Based on corpus, the probability of words composed of adjacent words is counted. The more times adjacent words appear, the greater the probability. Therefore, word segmentation is based on probability values, so a complete corpus is very important.
The main statistical models are: N-gram, Hidden Markov Model, Maximum Entropy Model (ME), Conditional Random Field Model (CRF) and so on.
1, word segmentation: Chinese task word segmentation is essential, and Jabba word segmentation is generally used, which is a leader in the industry.
2. Go to stop words: Establish stop words Dictionary. At present, there are about 2,000 dictionaries about stop words, and stop words mainly includes some adverbs, adjectives and conjunctions. Maintaining the non-indexed word list is actually a process of feature extraction, which is essentially a part of feature selection.
3. Part-of-speech tagging: judging the part-of-speech (verbs, nouns, adjectives, adverbs, etc.) after word segmentation, which can be obtained by setting parameters when using street fighter word segmentation.
The core of text classification is how to extract key features that can reflect text features from text and capture the mapping between features and categories. Therefore, feature engineering is very important and can be composed of four parts:
1, feature representation based on bag model: the bag constructed by Unigram may reach tens of thousands of dimensions, and the size of the bag may reach hundreds of thousands if binary and ternary models are considered, so the feature representation based on bag model is usually extremely sparse.
There are three methods to characterize (1) packages:
(2) Advantages and disadvantages:
2. Feature representation based on embedding: calculating the features of the text through word vectors. (mainly for short texts)
4. Features extracted based on the task itself: mainly designed for specific tasks. Through our observation and perception of data, we may be able to find some potentially useful features. Sometimes, these manual features greatly improve the final classification effect. For example, for the task of classifying positive and negative comments, the number of negative words is a strong one-dimensional feature for negative comments.
5. Feature fusion: In the case of high feature dimension and complex data pattern, nonlinear models (such as popular GDBT and XG Boost) are recommended; It is suggested to use a simple linear model (such as LR) when the feature dimension is low and the data pattern is simple.
6. Theme features:
LDA (Topic of Document): It can be assumed that there are T topics in the document collection, and a document may belong to one or more topics. Through LDA model, the probability that a document belongs to a certain topic can be calculated, and then the DxT matrix can be calculated. LDA features perform well in tasks such as document tagging.
LSI (latent semantics of documents): The latent semantics of documents are calculated by decomposing the document-word frequency matrix, which is similar to LDA and is the potential feature of documents.
This part is not the point. All models that can be used for classification in traditional machine learning algorithms can be used, such as NB model, random forest model (RF), SVM classification model, KNN classification model and neural network classification model.
Bayesian model is emphasized here, because the industry uses this model to identify spam [2].
1, fastText model: fastText is a paper just published in July 16 after Mikolov, the author of word2vec, moved to Facebook: a trick to efficient text classification [3].
Model structure:
Improvement: Attention mechanism is a common modeling long-term memory mechanism in the field of natural language processing, which can intuitively give the contribution of each word to the result, and has basically become the standard of Seq2Seq model. In fact, text classification can also be understood as a special kind of Seq2Seq in a sense, so attention mechanism has only recently been considered.
Process:
Forward and backward RNN are used to obtain the representation of the forward and backward context of each word:
The representation of words becomes a form in which word vectors are connected with forward and backward context vectors;
Obviously, the model is not the most important: a good model design is very important for obtaining good results, and it is also the focus of academic attention. But in practice, the workload of the model takes up relatively little time. Although the second part introduces five models of CNN/RNN and its variants, in fact, only CNN is enough to achieve very good results in Chinese text classification tasks. Our experimental test shows that RCNN improves the accuracy of about 1%, but it is not very significant. The best way is to use TextCNN model to debug the overall task effect to the best, and then try to improve the model.
Know your data: Although applying deep learning has a great advantage that it no longer needs tedious and inefficient artificial feature engineering, if you just treat it as a black box, then you will inevitably doubt life frequently. Be sure to understand your data, and remember that data sense is always very important regardless of traditional methods or deep learning methods. Pay attention to badcase analysis to find out whether your data is appropriate and why it is right or wrong.
Hyper-parameter adjustment: You can refer to the adjustment skills of deep learning network parameters-Zhihu column
Be sure to use dropout: there are two situations that can be avoided: the amount of data is extremely small, or you use a better regularization method, such as bn. In fact, we have tried dropout with different parameters, and the best one is 0.5, so if your computing resources are limited, the default value of 0.5 is a good choice.
You don't have to be softmax loss: it depends on your data. If your tasks are not mutually exclusive among multiple categories, you can try to train multiple binary classifiers, that is, define the problem as multi-label rather than multi-category. After adjustment, the accuracy is improved by > 1%.
Category imbalance: it is basically a conclusion that has been verified in many scenarios: if your loss is dominated by certain categories, it is mostly negative for the whole. It is suggested that a method similar to booststrap can be tried to adjust the weightlessness of the sample.
Avoid training shock: by default, random sampling factors must be added to distribute iid as much as possible, and the default shuffling mechanism can make the training results more stable. If the training model is still very unstable, consider adjusting the learning rate or mini_batch_size.
Zhihu's text multi-label classification contest, the introduction website for the first and second place:
Summary of champion of NLP contest: 3 million Zhihu multi-label text classification task (with deep learning source code)
20 17 Zhihu Kanshan Cup from entry to second place.