Current location - Education and Training Encyclopedia - Graduation thesis - Burt-Interpretation of Papers
Burt-Interpretation of Papers
BERT: Pre-training of Deep Bidirectional Transformer

language understanding

○ There are usually two strategies to apply the pre-training language model to downstream tasks:

The author believes that the bottleneck affecting the current pre-service language training model lies in its one-way nature. For example, GPT chooses a left-to-right architecture, so that each tag only pays attention to the previous tag, which has a secondary impact on the task at the sentence level, but has a great impact on the task at the tag level. For example, the question-and-answer task is very important to combine the context from two directions.

BERT used the Masking Language Model (MLM) inspired by cloze task to alleviate the one-way constraint problem of the previous model. MLM randomly masks some tags in the input text, and then predicts the masked tags according to the remaining context. In addition to the mask language model, the author also proposes the next task of sequence prediction to jointly train text pair representation.

The improvement of BERT in this paper is as follows:

General language representation before pre-training has a long history. In this section, we briefly review the most widely used methods.

2. 1 unsupervised method based on features:

For decades, learning widely applicable lexical representations has been an active research field, including non-nervous system and nervous system methods. Pre-trained word embedding is a part of modern NLP system, which provides significant improvement compared with embedding from scratch (Turian et al., 20 10). In order to train the word embedding vector in advance, the goal of language modeling from left to right (Mnih and Hinton, 2009) and the goal of distinguishing correct words from wrong words in left and right contexts (Mikolov et al., 20 13) are used.

These methods have been extended to coarser granularity, such as sentence embedding (Kiros et al., 2015; Logeswaran and Lee, 20 18) or paragraph embedding (Le and Mikolov, 20 14). In order to train sentence representation, previous work has used targets to rank candidate next sentences (Jernite et al., 2017; Logeswaran and Lee, 20 18), according to the expression of the previous sentence, the words of the next sentence are generated from left to right (Kiros et al., 20 15), or the target derived by the denoising automatic encoder (Hill et al., 20 16).

ELMo and his predecessors (Peters et al., 20 1720 18a) summarized the traditional word embedding research from different dimensions. They extract contextual features from left-to-right and right-to-left language models. The contextual representation of each tag is a series of left-to-right and right-to-left representations. When embedding context words into the existing task-specific framework, ELMo advances the latest technologies of several major NLP benchmarks (Peters et al., 20 18a), including question and answer (Rajpurkar et al., 20 16) and sentiment analysis (Socher et al., 20 13). Melamud et al. (20 16) proposed to learn context representation through a task, that is, to use LSTM to predict a single word from the left and right contexts. Similar to ELMo, their model is based on features, not depth bidirectional. Fedus et al. (20 18) shows that cloze task can be used to improve the robustness of text generation model.

2.2 unsupervised fine-tuning method:

Like the feature-based method, the first method only works in this direction when the word embedding parameters are pre-trained in unlabeled text. Recently, the encoder for generating sentences or documents represented by context tags has been pre-trained for unlabeled texts and texts, and fine-tuned for supervised downstream tasks.

The advantage of these methods is that few parameters need to be learned from scratch. At least partly because of this advantage, OpenAI GPT has made the latest achievements in many sentence-level tasks of GLUE benchmark. Left-to-right language modeling and automatic encoder target have been used for the pre-training of this model.

Note: Burt's overall pre-training and fine-tuning scheme. Except for the output layer, the same architecture is used in pre-training and fine-tuning. The same pre-training model parameters are used to initialize the models of different downstream tasks. During the fine-tuning process, all parameters will be fine-tuned.

2.3 Transfer learning based on supervised data:

Some studies also show that it can be effectively transformed in the supervision tasks of large data sets, such as natural language reasoning and machine translation. The research of computer vision also proves the importance of transfer learning from the large-scale pre-training model, and one of the effective methods is to use ImageNet to fine-tune the pre-training model.

This section will introduce BERT and its detailed implementation. There are two steps in our framework: pre-training and fine-tuning.

A notable feature of BERT is the unified architecture across different tasks. The difference between the pre-training architecture and the final downstream architecture is very small.

BERT's model architecture is a multi-layer bidirectional converter encoder, which is almost the same as the converter encoder in implementation.

Definition: the number of transformer blocks is L; The hidden size is h; The head of self-concern is a, and the author mainly shows two BERT models:

In this work, we represent the number of layers (deformation blocks) as L, the hidden size as H, and the number of self-concerned heads as A ... We mainly report the results of two models:

For comparison, BERT-base and OpenAI GPT are selected to have the same model size. But the key point is that BERT Transformer uses a two-way self-attention mechanism, while GPT Transformer uses a constrained self-attention mechanism, in which each tag can only pay attention to its left context.

In order to enable BERT to handle a large number of different downstream tasks, the author designed the input of the model to be able to input a single sentence or sentence pair, and modeled the two inputs as the same tag sequence. The author used a voiced word embedded with 30,000 token.

3. 1 pre-training BERT:

We don't use the traditional left-to-right or right-to-left language model to pre-train BERT. Instead, we use the two unsupervised tasks described in this section to pre-train BERT. This step is shown in the left half of figure 1.

Task # 1: mask LM

The standard language model can only realize the training from left to right or from right to left, but can't realize the real two-way training, because the two-way condition is that each word can "see itself" directly, and the model can easily predict the target word in the multi-layer context.

In order to realize two-way depth pre-training, the author chooses a random mask to remove some marks, and then predicts these masked marks. Under this setting, the hidden vector of mask marks represents softmax output to vocabulary, which is the same as the standard language model setting. The author calls this process "masking LM", also known as "cloze".

○ Masking the shortcomings of LM pre-training task;

Because the [MASK] mark does not appear in the fine-tuning stage, this leads to the inconsistency between the pre-training and fine-tuning stages. In order to solve this problem, the author proposes a compromise:

○ Bert's mask strategy:

Task 2: Predict the Next Sentence (NSP)

Many downstream tasks are based on the understanding of the relationship between two sentences, but the language model can't capture this information directly. In order to train the model to understand this inter-sentence relationship, the author designed a binary classification task of next sentence prediction. Specifically, two sentences are selected as training samples, the probability of the next sentence relationship is 50%, and the probability of the randomly selected sentence pair is 50%. The final hidden state C of CLS is predicted and input into sigmoid.

○ Data before training:

The author chooses BooksCorpus (800M words) and English Wikipedia(2500m words) as pre-training corpora, and only selects text paragraphs in Wikipedia, ignoring tables, titles and so on. In order to obtain a long continuous text sequence, the author chooses a document-level corpus such as a billion-word benchmark instead of a disrupted sentence-level corpus.

3.2 fine-tuning BERT:

Because the self-concern mechanism in transformer is suitable for many downstream tasks, you can directly fine-tune the model. For tasks involving text pairs, the general practice is to encode the text pairs independently, and then apply two-way cross attention to interact. Bert unifies these two stages by using self-attention mechanism, and can directly realize the cross-coding of two series of sentences.

For different tasks, simply insert the task-specific input and output into Bert, and then fine-tune end2end.

Compared with the previous training, fine-tuning is relatively cheap. Starting from the same pre-training model, all the results in this paper can be replicated on a single cloud TPU at most 1 hour, or on a GPU for several hours.

This section will introduce the BERT tuning results of 1 1 NLP task.

4. 1 glue:

Glue is a collection of NLP tasks. The author sets the batch size to 32; Training for 3 eras; ; Select the best learning rate from (5e-5, 4e-5, 3e-5, 2e-5) of the verification set. The results are as follows:

See table 1 for the results. In all tasks, the performance of BERT-base and BERT-large is better than all systems, and the average accuracy is improved by 4.5% and 7.0% respectively compared with the existing technology ... Please note that except for shielding, BERT-base and OpenAI GPT are almost the same in model architecture.

For MNLI, the largest and most widely reported gluing task, BERT achieved an absolute accuracy improvement of 4.6%. In the official GLUE ranking 10, BERT-lagle scored 80.5 points, while at the date of writing, OpenAI GPT scored 72.8 points. We find that BERT-large is obviously superior to BERT-base in all tasks, especially in the case of little training data.

4.2 Team v 1. 1:

Stanford Q&A data set (SQuAD v 1. 1) collected 65,438+10,000 pairs of crowdsourcing Q&A pairs. Give a question and an article with an answer in Wikipedia. The task is to predict the answer text in the article.

As shown in figure 1, in the question-and-answer task, we represent the input questions and paragraphs as a single compressed sequence, in which the questions are embedded by A and the paragraphs are embedded by B. In the fine-tuning process, we only introduce a starting vector S and an ending vector E. The probability of the beginning of the word I as the answer range is calculated as the dot product between Ti and S, and then the softmax of all the words in the paragraph:

Use a similar formula at the end of the answer range. The score of the candidate from position I to position J is defined as: S Ti+E Tj, and the maximum score span is j≥ i, which is used for prediction. The training target is the sum of logarithmic probabilities of the correct starting position and ending position. We have fine-tuned three stages, the learning rate is 5e-5 and the batch size is 32.

Table 2 shows the top leaderboard entries and the results of the top publishing system. The top few teams in the leaderboard did not disclose the latest description of the * * * system, and any public * * * data was allowed when training the system. Therefore, we use moderate data expansion in the system, first fine-tuning on TriviaQA, and then fine-tuning the team.

Our best-performing system is superior to the number one system in ensemble,+1.5 F 1 in ensemble and+1 in single system. In fact, our single BERT model is superior to the top ensemble system in F 1. Without TriviaQA fine-tuning data, we will only lose 0. 1-0.4 F 1, which is still far more than all existing systems.

Other experiments: ellipsis

In this section, we conducted ablation experiments on many aspects of BERT in order to better understand their relative importance. See appendix c for other ablation studies.

5. 1 Effect of pre-training task:

The following ablation tests were performed:

○ The results are as follows:

5.2 Influence of model size:

○ The results are as follows:

The author proves that if the model is fully pre-trained, even if the model scale is expanded to a large scale, the downstream tasks with less training data can be greatly improved.

5.3 Applying Bert to feature-based methods:

○ The feature-based method is to extract fixed features from the pre-training model without fine-tuning specific tasks.

○ This method also has certain advantages:

The author carried out the following experiments: completing the NER task on CoNLL-2003 data set, extracting activation values from one or more layers, inputting them into two-layer 768-dimensional BiLSTM, and then directly classifying them. The results are as follows:

The results show that the Bert model is effective regardless of fine tuning.

Personally, I think Bert's significance lies in:

Due to the transfer learning of language model, recent experience improvement shows that rich and unsupervised pre-training is an integral part of many language understanding systems. In particular, these results enable even low-resource tasks to benefit from the deep unidirectional architecture. Our main contribution is to further extend these findings to the deep bidirectional architecture, so that the same pre-training model can successfully handle a wide range of NLP tasks.