Transformer consists of modules: attention (including multi-head self-attention &; Context-Attention), normalization (using layer norm, which is different from batch norm), mask (filling mask & Sequence mask), position coding, feedforward network (FFN).
The overall structure of the transformer is shown in the following figure:
This is a typical transformer structure. Simply put, Transformer = pre-training (input)+encoder * N+ pre-training (output)+decoder * N+ output.
The running steps of the model are as follows:
(1) embedded input, you can use Word2Vec and other tools, the dimension is 5 12. After embedding, it is combined with position coding to record the position information of the input words.
(2) After preprocessing, the input vector is processed by multi-head attention layer, added with residual and regularization, and the data is given to FFN (fully connected layer), then added with residual and regularization. Repeatedly, after six such encoders (i.e., Nx=6x), the encoding part ends.
(3) The first decoder in the coding part pays attention to receiving the information from the output, and the rest all accept the information from the encoder and the superior decoder. The final output is generated in series, and each generated output is input into the input of the decoder at the bottom of the decoder.
(4) There are six decoders, and the final output has to go through the linear layer and Softmax to get the final output.
It should be noted that the structure of encoder and decoder is the same, but it does not enjoy weight; In the encoder part, each word has a dependency in the attention path layer, which is executed serially, but there is no dependency in the FFN layer, and it is executed in parallel.
In this structure, there are several kinds of attention, including: self-attention &; Contextual attention and. The attention of scale point products and. Multi-head attention. It should be noted that proportional dot product attention and multi-head attention are two calculation methods of attention, which will be introduced later.
The calculation formula of this concern is:
Take the first encoder as an example to illustrate the following process:
① Create the following three vectors for each word of the encoder: query vector, key vector and value vector. These three vectors are obtained by multiplying the input embedding by three vector matrices. It should be noted that the embedding vector dimension is 5 12 and the Q K V vector dimension is 64.
② Calculate the score: for each word, calculate the product of itself and all.
③ Calculation of attention: According to the above attention formula, divide the score by a fixed value (this operation is called "scaled"), and carry out Softmax transformation to make the sum of all scores 1. Finally, multiply it by the corresponding position to get the attention of this word.
This is the calculation method of the zoom dot product attention mechanism. Both kinds of attention are used in the Transformer architecture, but the source of Q K V is somewhat different.
Note: why does Softmax divide by a root sign? The reason given in this paper is that the original sum is both a variable with a mean of 0 and a variance of 1. Assuming that they are independent of each other, the distribution of their products is that the mean value is 0 and the variance is 0. Dividing by the root sign makes the value in Softmax keep the average value of 0 and the variance of 1, which is beneficial to gradient calculation. If you don't add the root sign, the calculation will converge slowly, because the values in Softmax are in the fading vanishing zone.
Think further: Why are there no proportional steps in many concerns? Note that there are two kinds, one is multiplication and the other is addition. Experiments show that although addition looks simple, it is not much faster to calculate (tanh is equivalent to a complete hidden layer), and it is really better in the case of higher dimensions, but it is almost the same if Scaled is added. In order to speed up the calculation, multiplication is selected in Transformer, and scaling will be added if the dimension is large.
Multi-attention mechanism is also a processing skill, which mainly improves the performance of attention layer. Because the self-concern mentioned above includes codes in other positions, the words in their own positions are still dominant. Sometimes we need to pay more attention to other positions, such as which subject the pronoun refers to in machine translation.
The mechanism of multi-head attention is to project three matrices of Q, K and V through H linear transformations, then calculate H times of self-attention, and finally splice H times of calculation results.
In the encoder's self-concern, Q K V is the output of the previous encoder. For the first encoder, they are the sum of input embedding and position coding.
In the decoder's self-concern, Q K V is also the output of the previous decoder. For the first decoder, they are the sum of input embedding and position coding. It should be noted that in this part, we don't want to get the later data, but only want to consider the predicted information, so we have to mask the sequence (later).
In encoder-decoder attention (i.e., context attention), q is the output of the upper layer of the decoder and K V is the output of the encoder.
LN is used for Transformer, not BN (batch normalization). What is normative normalization? Generally speaking, it can be expressed by the following formula:
Formula 1 is before standardized treatment, and formula 2 is after treatment. Normalization is the adjustment of data distribution, for example, the data itself is normal distribution, and the adjusted data distribution is standard normal distribution, which is equivalent to adjusting the mean and variance. The significance of this is that the activation value falls into the sensitive range of the activation function, and the gradient update becomes larger to accelerate the training, eliminate the extreme value and improve the training stability.
The transformer uses LN, not BN. Look at the difference between the two as shown in the figure:
LN standardizes each sample itself, while BN standardizes a batch of data in the same dimension, which is cross-sample. In CNN task, BatchSize is large, and the mean and variance of samples are recorded globally during training, which is suitable for BN. In the problem of time series, it is unrealistic to count every neuron. The limit of LN is much smaller, even if BatchSize= 1.
There are two kinds of masks, one is filling mask and the other is sequence mask, which appear in different positions in Transformers: the filling mask appears in all scaled dot product concerns, and the sequence mask only appears in the decoder's self-concern.
Because the length of each batch of input sequences is different, a padding mask is used to align the sequence length. Simply put, short sequences are aligned with long sequences by adding 0. The supplementary place is meaningless, so don't pay attention. In fact, we don't directly add 0 to the corresponding positions, but add -inf (negative infinity), so that after Softmax, the probability of these positions is close to 0.
During processing, the fill mask is a Boolean tensor, where false is the place where 0 is added.
As mentioned earlier, the function of sequence mask is to prevent the decoder from seeing the information after the current moment, so the latter part of the information should be completely covered. Specifically, an upper triangular matrix is generated. The values of the upper triangle are all 1, and the lower triangle and diagonal are all 0.
In the self-concerned part of the decoder, the sequence mask and the padding mask work at the same time, and they are added as masks.
It is natural and orderly for RNN to deal with the order problem, and Transformers eliminates this dependence on time order. Taking machine translation as an example, if the output is a complete and reasonable sentence, it is necessary to add position information to the input data, otherwise every word output may be correct, but it will not become a sentence. Position coding is to encode the position of the input information and then add it to the input embedding.
Position coding uses sine and cosine coding:
In even positions, the sine coding of formula 1 is used, and in odd positions, the cosine coding of formula 2 is used. Because of the characteristics of sine and cosine function, this kind of coding is both absolute position coding and relative position coding.
The relative position coding information mainly depends on trigonometric function and angle formula:
FFN is a fully connected network, which undergoes linear transformation, ReLU nonlinear transformation and linear transformation in turn. The formula is as follows:
References:
[finishing] Tell me about Transformers
Twitter: Details of Transformers
Give an example of what transformers are.
Text Classification Exercise (8)-Transformer Model
Deep Learning: Transformers Model