Bidirectional model has been used in the field of natural language processing for a long time. These models involve two text viewing orders: left to right and right to left. BERT's innovation lies in learning bidirectional representation with the help of Transformer, which is a deep learning component, different from recurrent neural networks? (RNN) Depending on the sequence, the whole sequence can be processed in parallel. Therefore, a larger data set can be analyzed and the model training speed can be accelerated. Transformer can use attention mechanism to collect information about word context and encode it with rich vectors representing context, so as to process all words related to other words in a sentence at the same time (rather than separately). This model can learn how every other word in a clause paragraph deduces the meaning of a given word.
Previous word embedding techniques (such as GloVe and Word2vec) run without context and generate a representation of each word in the sequence. For example, whether it refers to sports equipment or Nocturnal Animals, the word "bat" will be expressed in the same way. ELMo through bidirectional long-term and short-term memory model? (LSTM), which introduces the deep situational representation of each word in a sentence based on other words in the sentence. But unlike BERT, ELMo considers the path from left to right and from right to left respectively, instead of treating it as a single unified view of the whole situation.
Because most BERT parameters are dedicated to creating high-quality context word embedding, this framework is very suitable for transfer learning. By training BERT with self-monitoring tasks such as language modeling (tasks that do not need manual labeling), we can use large unlabeled data sets such as WikiText and BookCorpus, which contain more than 3.3 billion words. To learn other tasks (such as question and answer), you can replace and fine-tune the last layer with content suitable for the corresponding task.
The arrows in the figure below represent the information flow from one layer to the next in three different NLP models.
BERT model can understand the nuances of expression more finely. For example, the processing sequence "Bob needs some medicine. His stomach is upset. Can you bring him some antacid? " Bert can better understand that "Bob", "His" and "His" all refer to the same person. Previously, when querying "How to fill Bob's prescription", the model may not understand that the person quoted in the second sentence is Bob. After applying BERT model, the model can understand the relationship between all these related points.
Two-way training is difficult to achieve, because by default, each word is adjusted to include the words predicted in the multi-layer model on the basis of the previous word and the next word. BERT's developers solved this problem by shielding the predicted words and other random words in the corpus. BERT also uses a simple training technique to try to predict whether the given two sentences A and B: B and A are sequential or random.