Current location - Education and Training Encyclopedia - Graduation thesis - Swin transformer
Swin transformer
At present, the challenge of Transformers from language to visual task mainly stems from the differences between these two fields:

In order to solve the above two points, we propose a layered transformer, which extracts features by sliding window and reduces the calculation of self-attention to linear correlation with image size.

We have observed that the main problems in transferring the language field to the visual field can be summarized into two types:

In the source code implementation, two modules are combined into one, which is called PatchEmbedding. Input a picture with RGB picture size, regard 4x4x3 as a patch, and transform the patch into features of any dimension with linear embedding layer. In the source code, the conv implementation with 4x4 span =4 is used. -& gt;

This is the core module of this paper.

Window partition is divided into conventional window partition and shift window partition, which correspond to W-MSA and SW-MSA respectively. Transform the input feature map into num _ windows * b, window _ size, window _ size, c through window division, where num _ windows = h * w/window _ size/window _ size. Then resize it to num _ windows * b, window _ size * window _ size, C. The source code is as follows:

It consists of a regular window division module and a multi-head self-concern module.

Compared with using MSA directly, W-MSA mainly reduces the amount of calculation. The traditional transformer is based on the global calculation of attention, and the calculation complexity is very high. But swin transformer reduces the amount of calculation by paying attention to each window. The main calculation process is as follows:

Assuming that the block size of each window is and the input size is, the computational complexity of the original sum is as follows:

Although the amount of calculation is reduced, the performance of the model is limited because the attention is limited to the windows and the non-overlapping windows are lack of connection. Therefore, a module is proposed. Add a cyclic shift window partition before MSA.

In swin transformer, the pool is not used to downsample the pixels, but in yolov5, the focus layer is used to downsample the feature map. -& gt; , using full connection layer->; In one stage, the height and width of the feature map are halved and the number of channels is doubled.

The structure of the benchmark model is named Swin-B, and the model scale and computational complexity are similar to those of VIT-B/DEIT-B. At the same time, we also propose Swin-T, Swin-S and Swin-L, which correspond to the model scale and computational complexity of 0.25×, 0.5× and 2× respectively. The computational complexity of Swin-T and Swin-S is similar to that of ResNet-50 and RESNET-10/respectively. The default setting is 7. Represents the number of hidden layers in the first layer.