2022-02- 18 - Education and Training Encyclopedia

2022-02- 18

[Contents]

In the article "Attention is all you need", the position coding method is basically just passed by, without explaining in detail why the position coding should be added to self-attention and the benefits of the coding method in the article.

The coding method in this paper is as described in the above formula, in which the first component of the coding vector is the dimension of the vector.

For questions such as "How did it come into being" and "Does it have to be in this form?" There is still no good answer to such a basic question

For the self-concern model, it is = = symmetry = =. Suppose the model is, what is a symmetric model, that is, for any m and n inputs,

This is the so-called symmetry, which is also the reason why transformer can't identify the position information-= = symmetry = =. Simply put, functions naturally satisfy identities.

What we need to do now is to break this symmetry and consider the position information, for example, add a different offset vector for each position:

As long as the coding vector of each position is different, this symmetry will be broken, and it can be used instead of processing ordered input (it combines the position information of each mark).

At present, only the position codes of m and n are considered as perturbation terms. Taylor expansion (second order):

According to Taylor expansion, we can see that the first item has nothing to do with position information, and the second to fifth items all depend on a single position information (= = absolute position information = =). The sixth item contains both information of M and N, which is an interactive item between them. Hope to represent a = = relative position information = =.

According to formula (3), it can be seen that the last bit of the expansion is the interactive item of = = two positions, which implies the relative position information of the two positions, so why can it express the relative position information?

Suppose identity matrix, this time = = the inner product of two codes = =, and we want this term to represent the relative position information of two positions, that is, = = has a function = = to make.

In the transformer architecture, position coding provides supervisory information for the dependency modeling of elements in different positions in the sequence. In this paper, various position coding methods are studied in the language model based on transformer, and a new coding method-rotary position embedding method is proposed. This method uses the rotation matrix to encode the absolute position information, and naturally contains the relative position dependence in the self-attention formula.

RoPE has some valuable characteristics, such as it can be flexibly extended to any length sequence, the dependence between token will gradually decrease with the increase of relative distance, and it can provide relative position information for linear self-attention.

Therefore, the experiments in this paper show that the improved transformer with rotation position matrix (RoFormer for short) has achieved better performance in various language model tasks.

The word order of sequences plays an important role in natural language. RNNs calculates the word order information of hidden state codes cyclically along the time step. CNN usually doesn't consider word order information, but recent research shows that location information can be learned implicitly by using general filling operations.

In recent years, the effectiveness of transformer-base model has been proved to be effective in various NLP tasks, providing better parallelization ability than RNNs, and this model can handle the relationship between long tokens better than CNN.

Considering that RNN and CNN are not used in Transformers, and it is proved that the self-attention structure has no position information, many different methods are proposed to insert position information into the model.

In the work of this paper.

This article = = submission = =:

How to write the title of the paper?

Which district is Hangzhou Normal University?

How to write the personal statement of the paper?

Standard format and font size of graduation thesis

How to keep up with the times when fast fish eat slow fish?

On the Benefits of Pu 'er Tea

There may be two ways for aliens to lick the energy of black holes.

Please show me the legal documents! thank you

Do you need to check the duplicate for off-campus pre-examination?

Wu Dong strategic change document