Current location - Education and Training Encyclopedia - Graduation thesis - Paper reading notes
Paper reading notes
Reading Notes-Recursive Neural Network Based on Lattice, Encoder

Used in neural machine translation (Su Jinsong et al.)

Abstract introduction:

NMT neural machine translation relies heavily on word-level modeling to learn the semantic representation of input sentences.

For dealing with languages without natural language separators (such as Chinese), it is necessary to mark them first, which comes into being.

Two questions: 1) For the source sentence model, it is very difficult to find the optimal tag granularity. Coarse granularity leads to sparse data, while fine granularity leads to loss of useful information; 2) When it is difficult, it is easy to make mistakes, which will be brought to the encoder of NMT and affect the representation of the source sentence.

Based on these two problems, it is necessary to provide NMT with multiple tags instead of a single tag sequence in order to better model the source sentence.

In this paper, a recursive neural network NMT encoder based on word lattice is proposed: 1), which takes the compression coding of multiple logo word lattices as input; 2) And learn to generate new hidden states from any number of input and hidden states in previous time steps.

Lattice is a compressed representation of many marks, and the encoder based on lattice not only reduces the mark error of the best mark mode (1-best)

Tagging errors), and embedding input sentences is more expressive and flexible.

?

NMT features:

The potential structure and correspondence between the source language and the target language in the traditional statistical machine translation simulation pipeline,

NMT trains a unified coding and decoding neural network, in which the encoder maps the input sentences to fixed-length vectors and the decoder generates translations from the encoded vectors.

Recursive neural network NMT based on word lattice;

This paper studies and compares two kinds of RNN encoders based on word lattice:

1). shallow lattice GRU encoder: based on the combination of input and hidden state from multiple tags adopting standard GRU architecture;

2) GRU encoder of depth lattice: it learns and updates the tag specificity of gate, input and hidden state.

Vector), and then generate the hidden state vector of the current cell.

In these two kinds of encoders, many different markers can be used to simulate input sentences at the same time.

Conclusion:

Compared with the standard RNN encoder, the encoder in this paper uses both the input and the previous hidden state, and relies on multiple tags to model the source sentence. Therefore, they not only reduce the 1- optimal tokenization.

Error propagation, and more expressive and flexible than the standard encoder.

The experimental results of Chinese-English translation show that the encoder in this paper has improved significantly in various baselines.

Outlook:

The network structure of this paper depends on the word case of the source sentence. Expand the model and integrate the segmentation model into the source sentence representation learning. In this way, symbol and translation can cooperate with each other. In addition, a better combination strategy is adopted to improve the encoder.

Verification experiment:

In order to verify the validity of the encoder, we have carried out experiments on Chinese-English translation tasks.

The experimental results show that:

(1) It is very necessary to use word boundary information to learn accurately embedded Chinese sentences.

(2) RNN encoder based on word lattice is superior to NMT standard RNN encoder. As far as we know, this is the first attempt to construct NMT on word lattice.

Experimental part:

1. data set

The encoder proposed in NIST Chinese-English translation task is evaluated.

Training data set: 6.5438+0.25 million pairs of sentences extracted from LDC2002E 18, LDC2003E 14, LDC2004T07 and LDC2005T06, including 27.9 million Chinese words and 34.5 million English words.

Validation data set: NIST 2005 data set

Test data set: NIST data sets NIST 2002, 2003, 2004, 2006 and 2008.

We use toolkit2 published by Stanford University to train word breakers on CTB, Peking University and MSR corpora to obtain Chinese sentence lattices.

In order to train neural networks effectively, we use 50K words which are most commonly used in Chinese and English as our vocabulary. In CTB, Peking University, MSR and lattice corpora, Chinese vocabulary accounts for 98.5%, 98.6%, 99.3% and 97.3% respectively, and English vocabulary accounts for 99.7%.

2. Experimental results:

Character coverage:

Translation quality:

NMT decoding experiment using 1- optimal word segmentation;

Model:

Word case

Lattice model is completely independent of word segmentation, but it is more effective when using word information, because it can choose words freely in the context to eliminate ambiguity.

Two kinds of RNN encoders based on word lattice

Named entity recognition method based on BLSTM (Feng et al.)

Abstract introduction:

(1) The corpus of supervised learning is insufficient; (2)RNN can't handle the long-distance dependence problem well, and the training algorithm has the problem of gradient disappearance or explosion.

Based on three considerations: (1) whether a text is recognized as a named entity is related to its context and also to each word and word order that constitutes a named entity; (2) Considering the correlation between tags in the labeling sequence, the cost function of the model proposed in this paper is constrained, and valuable information is mined as much as possible on small training data to improve the effect of named entity recognition; (3) Artificial features and domain knowledge in traditional recognition methods have an important influence on the recognition effect of named entities, but the design of artificial features and the acquisition of domain knowledge are expensive.

Therefore, this paper puts forward an effective method to solve the problem of named entity recognition by using neural network model. This method does not rely directly on artificial features and external resources, but only uses a small amount of supervised data, domain knowledge and a large amount of unlabeled data, which solves the problem that the current machine learning methods rely too much on artificial features and domain knowledge and lack of corpus. The named entity recognition method proposed in this paper integrates the context information of words, the prefix and suffix information of words and the domain dictionary, and characterizes these information as the distribution representation characteristics of words. Considering the constraint relationship between word tags, the recognition effect is further improved.

Outlook: This paper only reads the data in order to identify the named entity, and each word has the same influence on the named entity, without considering the different influence of different words on the named entity. How to introduce the attention mechanism of deep learning into this model and focus on the words that have important influence on the recognition of named entities is a problem to be further solved.

Experimental part:

Data set:

DataSet 1 (large-scale unlabeled corpus), DataSet2 (labeled corpus), DataSet3 (named entity identification labeled corpus).

DataSet4 (in this paper, the labels in DataSet2 and DataSet3 are deleted and split into character sequence data to get a data set).

DataSet5 (select some data from sogou input method thesaurus [including common names of China, China and state organs and organizations, and split them into character sequence data].

Sample classification: TP? FP? TN? [Mathematics] Function

Evaluation indicators: accuracy (p), recall (r), f-score (f), sensitivity, Sent), Specificity, spec), 1- specificity (1GSpec) and accuracy (0GSpec).

Experimental results:

Experimental influencing factors:

The length of two named entities, place name and organization name, is usually longer than people's names, and their composition is complicated. The word vector based on context and the word vector trained by BLSTM_Ec model have a positive impact on the recognition effect.

The length of name words is short, there is no strong binding relationship between people's surnames and first names, and there is no strong correlation between names in the name dictionary and name entities in the text to be recognized. Therefore, prefix and suffix information, tag constraint information and domain knowledge have some influence on name entities, but they have little influence.

Model:

Where Ec is a character-level vector; Ew is a word vector based on context words.

Reading Notes —— Intelligence Research of Automated Chinese

Word Segmentation for Oral Comprehension and Named Entities

Re-recognize (Luo et al.)

Background: In English text, sentences are sequences of words separated by spaces. Chinese sentences are strings without natural separators (other similar languages: Arabic, Japanese). The first step of Chinese processing task is to identify the word order in a sentence and mark the boundary in an appropriate position. Word segmentation in Chinese text can eliminate ambiguity to some extent. Word segmentation is usually regarded as the first step of many Chinese natural language processing tasks, but its influence on these subsequent tasks is relatively less studied.

Abstract introduction:

At present, the main problem is the mismatch of 1) when applying the existing word separator to new data; 2) Whether a better word separator can produce better performance of the subsequent NLP tasks.

In view of the above problems, this paper proposes three methods: 1) Using word segmentation output as an additional feature in subsequent tasks is more resistant to error propagation than using word segmentation units. 2) Using some tag data obtained from the training data of subsequent tasks to improve the existing word segmentation and further improve the end-to-end performance. 3) The n-best table output by word segmentation is used, which makes the subsequent tasks less sensitive to word segmentation errors.

The main task of Chinese word segmentation is: 1) to identify the word order in a sentence. 2) Mark the boundary at an appropriate position.

Summary:

This paper puts forward three methods: using word segmentation output as additional features; Carry out local learning and adaptation; Use the n-best table.

In addition, the influence of CWS in three different situations is also studied: 1) When the domain data has no word boundary information, the end-to-end performance can be improved by using the data outside the public domain, and the performance can be further improved by adjusting some tag data derived by manual labeling. 2) Marginalized n-best word segmentation will bring further improvement. When domain segmentation is available, the word classifier trained with domain data itself has better CWS performance, but it may not have better end-to-end task performance. A better end-to-end performance can be achieved by a more balanced word breaker in training and test data. 3) When dividing test data manually, word segmentation is really helpful to the task, and word segmentation can reduce the ambiguity of subsequent NLP tasks.

Possible future direction: stack two layers of CRF in turn, one for word segmentation and the other for follow-up tasks. Besides sequence marking, more follow-up tasks are discussed.

Experiment (NER part):

For the NER data used, both domain training and test data have word boundary information. This paper discusses the difference between word segmentation trained with domain data and publicly available data (the second case). Relationship between word segmentation scores and end-to-end follow-up tasks.

Experimental data: The benchmark NER data of Bakeoff (SIGHAN-3) are processed in the third generation SIGHAN Chinese.

(Levow, 2006). Training set data: 46,364 sentences, test set data: 4,365 sentences. These data are marked with word boundaries and NER information.

Experimental results: