Transformer for large-scale image recognition
https://arxiv.org/pdf/20 10. 1 1929.pdf
This work "Visual Transformer" is based on the transformer model that shines brilliantly in NLP field, and deals with the tasks in the visual field. In a simple way, the author transforms the two-dimensional image data into a form similar to the sentence sequence processed in Transformer, and then uses the Transformer encoder to extract features.
Transformer's paper is called attention is what you need. Now when it comes to attention in the field of deep learning, you may think of Transformer's self-attention mechanism. In fact, attention mechanism was originally applied to circulatory neural networks, and self-attention can be regarded as a more general version. Attention is initially a function of the intermediate hidden state in the encoder-decoder framework. Self-attention, on the other hand, does not care about the hidden state, but only about the dependence between vectors in the input sequence. Transformer gives a very concise formula.
When you see softmax, you know that you are seeking probability. V stands for numerical value and QK stands for dictionary lookup operation. But it's still abstract. If you want to understand it, you must decompose the matrix into vectors. This is a blog about visual transformers. https://jalammar.github.io/illustrated-transformer/
My understanding is that the original vector is coded three times, and then when calculating the attention result, one of the codes is only related to itself, representing the characteristics of the token, and the other two codes are used to match the codes of other vectors in the sequence to get the correlation degree between the current vector and other vectors.
The main reason why convolution is dominant in vision is the local receptive field, and the form of convolution is very suitable for image data processing. However, the receptive field of convolution is limited, and a large receptive field can only be obtained through multi-layer abstraction. Self-concern, I think, can be understood as selective weighting in global input. This process is repeated many times, which is the multi-head self-attention mechanism.
The final code is as follows:
Corresponding:
Now the input of the picture has been transformed into the form of word sequence processed by transformer through the above processing, and the final result is the features related to each patch in the picture obtained through multiple processing directly through the multi-head attention mechanism. It is equivalent to replacing convolution layer to complete feature extraction and get Z _ L.
Without convolution operation, training requires much less computational resources.
ViT can be very effective if it is pre-trained with a large number of data sets.
The performance of ViT model is better than the latest technology of the same order.
https://arxiv.org/pdf/2 103. 14030.pdf
Unlike adding an absolute position code to an input sequence in ViT, swinTransformer uses a relative position offset, which is added to the query operation inside attention. The paper has done experiments, and if the two methods are used at the same time, the performance will decline.