Attention-based Neural Machine Translation

In recent years, the research of deep learning has become more and more in-depth and many breakthroughs have been made in various fields. Neural network based on attention mechanism is a hot topic in neural network research in recent years.

Attention mechanism was first put forward in the field of visual images, which should have been put forward more than 90 years ago, but the paper "Recurrent Models of Visual Attention" written by google mind team [14] is really popular, which uses attention mechanism to classify images on RNN models. Subsequently, Bahdanau and others used a mechanism similar to attention to simultaneously translate and align in machine translation tasks in their paper "Neural Machine Translation of Joint Learning Alignment and Translation" [1], and their work first proposed the application of attention mechanism in NLP field. Then similar RNN model extensions based on attention mechanism began to be applied to various NLP tasks. Recently, how to use the attention mechanism in CNN has also become a hot research topic.

Before introducing attention in NLP, I want to talk about the idea of using attention in images. On the recursive model of the representative paper on visual attention [14], their research motivation is actually inspired by the human attention mechanism. When people observe an image, they actually don't look at every pixel in the whole image at once, but mostly focus on a specific part of the image as needed. Moreover, according to the previously observed images, humans will know where to focus their attention in the future. The following figure is a schematic diagram of the core model of this paper.

Based on the traditional RNN, this model adds the attention mechanism (the part circled in red), and learns the part of the image to be processed through attention. Every time in the current state, it will process the pixels of the concerned part according to the position L learned in the previous state and the current input image, instead of all the pixels of the image. This has the advantage that fewer pixels need to be processed, which reduces the complexity of the task. It can be seen that the application of attention in images is very similar to the human attention mechanism. Next, let's look at the attention used in NLP.

This paper applies attention mechanism to natural language processing for the first time. They applied the attention mechanism to neural network machine translation (NMT). NMT is actually a typical sequence-to-sequence model, that is, the encoder-to-decoder model. Traditional NMT uses two RNNs and one RNN to encode the source language, then encodes the source language into a fixed-dimensional intermediate vector, and then decodes and translates it into the target language with one RNN. The traditional model is as follows:

In this paper, NMT based on attention mechanism is proposed, and its model is roughly as follows:

In the picture, I didn't type all the connections in the decoder, only the first two words were drawn, and the words after them were actually the same. It can be seen that NMT based on attention is based on tradition, which relates the expressions learned by each word in the source language (traditional expressions learned only after the last word) with the words to be predicted for translation at present. This connection is achieved through their design attention. After the model training, according to the attention matrix, we can get the alignment matrix of the source language and the target language. The attention design part of the specific paper is as follows:

As you can see, they connect every word in the target language and the source language with a perceptron formula, and then normalize it with a soft function to get a probability distribution, which is the attention matrix.

From the results, compared with the traditional NMT(RNNsearch is the attention NMT, RNNenc is the traditional NMT), the effect has been improved a lot. The biggest feature is that it can be visually aligned and has more advantages in long sentence processing.

This paper is a very representative one after the last one. Their work tells us how to expand our attention in RNN. This paper greatly promotes the follow-up application of various models based on attention in natural language processing. In this paper, they put forward two kinds of attention mechanisms, one is the global mechanism and the other is the local mechanism.

First, let's take a look at the attention of the Global Mechanism. In fact, this is the same as the idea of attention put forward in the last paper. It deals with all the words in the source language, but the difference is that when calculating the value of attention matrix, he puts forward several simple extended versions.

In their final experiment, general's calculation method is the best.

Let's take a look at their local version. The main idea is to reduce the cost of attention calculation. When calculating attention, the author does not consider all the words in the source language, but predicts the position Pt of the source language to be aligned when decoding at present according to a prediction function, and then only considers the words in the window through the context window.

Two prediction methods, local-m and local-p, are given, and then when calculating the final attention matrix, it is multiplied by a Gaussian distribution related to pt position on the original basis. The author's experimental results show that local attention is better than overall attention.

I think the greatest contribution of this paper is that it first tells us how to expand the calculation method of attention, and then the local attention method.

Then the RNN model based on attention began to be widely used in NLP, not only from sequence to sequence, but also for various classification problems. So can CNN, a convolutional neural network as popular as RNN in deep learning field, also use attention mechanism? Paper ABC NN: Modeling Emotional Pairs Based on Attention Convolutional Neural Network [13] The method of using attention in C NN is proposed, which is an early exploratory work of attention in CNN.

When traditional CNN builds sentence pair model, it processes a sentence through each single channel, then learns sentence expression and finally inputs it into the classifier. The model has no correlation between sentence pairs before entering the classifier, so the author hopes to link sentence pairs of different cnn channels by designing attention mechanism.

The first method ABCNN0- 1 is to pay attention before convolution, calculate the attention feature map of the corresponding sentence pair through the attention matrix, and then input it into the convolution layer together with the original feature map. The specific calculation method is as follows.

The second method, ABCNN-2, focuses on the convolution expression through attention, and then pools it. The principle is as follows.

The third method is to use the first two methods simultaneously in CNN, as shown below.

This article provides us with the idea of using attention in CNN. At present, many people use CNN because of their concern for their work, and have achieved good results.

Finally, make a summary. In fact, I think attention in NLP can be regarded as an automatic weighting, which can connect two different modules you want to connect. At present, the mainstream calculation formula is as follows:

By designing a function connecting the target module mt and the source module ms, and then normalizing it with a soft function, the probability distribution is obtained.

At present, attention has been widely used in natural language processing. It has a great advantage that it can visualize the attention matrix and tell you which parts the neural network pays attention to when performing tasks.

However, the attention mechanism in NLP is still different from that of human beings. It basically needs to calculate all the objects to be processed and use an extra matrix to store their weights, which actually increases the overhead. Instead of ignoring what you don't want to pay attention to, just deal with what you care about.

Historical changes of Changsha

H index is a new method to evaluate academic achievements.

Food composition carry by ants

How many chunks of percussion music are there in Beijing Opera?

How many years is the Spanish doctoral program? How many years do you usually graduate?

Creative paper and painting

What do you mean by arguing?

Paper PS is fraud.

Brief introduction to the selection of national excellent doctoral dissertations

Standard format of reference: two standard formats of reference.