[Image upload failed ... (Picture-306069-1652190795979)]
? In addition to accuracy, the reasoning speed of the model is also important. In order to obtain a deployment-friendly and high-precision model, many recent studies have proposed to improve the model performance based on structural reparameterization. The model used for structural reparameterization has different structures in the training stage and reasoning stage. In training, the complex structure is used to obtain high precision. After training, the complex structure is compressed into a linear layer that can be quickly reasoned through equivalent transformation. Compression models usually have a concise architecture, such as VGG-like or Reyes-like structures. From this point of view, the reparameterization strategy can improve the model performance without introducing additional reasoning time overhead. Before the official account of WeChat, an article was posted on RepVGG to interpret RepVGG:VGG, God forever! | 202 1 new article, you can have a look if you are interested.
[image upload failed ... (image-aab93a-1652190795979)]
? BN layer is a key component of multi-parameter model. Add a BN layer after each convolution layer, as shown in figure 1b, and removing the BN layer will lead to a serious decline in accuracy. In the reasoning stage, complex structures can be compressed into a single convolution layer. In the training stage, because BN layer needs to divide the feature map by its standard deviation nonlinearly, it can only calculate each branch separately. Therefore, there are a lot of intermediate computing operations (large FLOPS) and buffered feature mapping (high memory usage), which brings huge computational overhead. To make matters worse, the high training cost hinders the exploration of more complex and possibly more powerful reparametric structures.
? Why is BN layer so important for reparameterization? Through experiments and analysis, it is found that the scale factor of BN layer can diversify the optimization direction of different branches. Based on this discovery, this paper proposes an online re-parameterization method OREPA, as shown in figure 1c, which includes two steps:
? OREPA reduces the computation and storage overhead brought by the middle layer, can significantly reduce the training consumption (memory saving 65%-75%, speed increase 1.5-2.3 times), and has little impact on performance, making it possible to explore more complex reparameterization results. In order to verify this point, this paper further proposes several re-parameterized components to obtain better performance.
? The contribution of this paper includes the following three points:
[image upload failed ... (image-9cc5d8-1652190795979)]
? OREPA can simplify the complex structure during training into a single convolution layer, and keep the accuracy unchanged. The transformation process of OREPA is shown in Figure 2, which includes two steps: block linearization and block extrusion.
[Image upload failed ... (Picture-86E6f1-1652190795979)]
? BN layer is the key structure of multi-layer and multi-branch structure with heavy parameters, and it is the basis of the performance of multi-parameter model. Taking DBB and RepVGG as examples, after removing BN layer (changing to multi-branch unified BN operation), the performance will obviously decrease, as shown in table 1.
? Surprisingly, using BN layer will bring too high training consumption. In the reasoning stage, all the intermediate operations in the multi-parameter structure are linear and can be combined. In the training stage, the BN layer is nonlinear (it needs to be divided by the standard deviation of the feature map), so it is impossible to perform combined calculation. The failure of merging will cause the intermediate operation to need to be calculated separately, resulting in huge calculation consumption and memory overhead. In addition, the high cost also hinders the exploration of more complex structures.
? Although BN layer prevents the combined calculation during training, it still cannot be deleted directly due to the accuracy problem. In order to solve this problem, channel linear scaling is introduced as a linear replacement of BN layer, and the feature map is scaled with learnable vectors. The linear scaling layer has a similar effect to BN layer, guiding multiple branches to optimize in different directions, which is the core of reparameterization performance.
[Image upload failed ... (Picture-E04663-1652190795979)]
? Based on the linear scaling layer, the re-parameterized structure is modified by the following three steps, as shown in Figure 3:
? After the block linearization operation, there is only one linear layer in the multi-parameter structure, which means that all components in the structure can be merged in the training stage.
? Block squeezing transforms the operation on the middle feature graph with too much computation and storage into faster operation on a single convolution kernel, which means that the extra training cost of heavy parameters is reduced from 100% to 100% in terms of computation and storage, here is the shape of convolution kernel.
? Generally speaking, no matter how complex the linear multi-parameter structure is, the following two properties always hold:
[image upload failed ... (image-fa8e1d-1652190795979)]
? With the above two attributes, multi-layer (i.e. sequence structure) and multi-branch (i.e. parallel structure) can be compressed into a single convolution, as shown in Figures 4a and 4b. Some conversion formulas in the original text prove that interested people can look at the corresponding chapters of the original text without affecting their understanding of Block Squeezing.
? This paper analyzes the function of multi-branch and block linearization from the perspective of gradient regression, which contains some formula derivation. You can look at the corresponding chapters of the original text with interest. There are two main conclusions here:
? The above conclusion shows the importance of block linearization step. After removing BN layer, the scaling layer can keep the diversity of optimization directions and avoid multi-branch degenerating into single branch.
? Because OREPA saves a lot of training consumption, it is possible to explore more complex training structure. Based on DBB, a brand-new heavy parameter module OREPA-ResNet is designed, and the following components are added:
[Image upload failed ... (Picture-75fb2e-1652190795979)]
[image upload failed ... (image-bb4d60-1652190795979)]
? The design of block in OREPA-ResNet is shown in Figure 6, which should be a down-sampled block, and finally merged into a 3×3 convolution for training and reasoning.
[image upload failed ... (image-E30EFC-1652190795979)]
? Comparative experiment of each component.
[Image upload failed ... (Picture-132b64-1652190795979)]
? The influence of scale layer on the similarity of each branch of each layer.
[Image upload failed ... (Picture-35E6fb-1652190795979)]
? Compared with linear expansion strategy, channel expansion is the best.
[Image upload failed ... (Picture-8E6901-1652190795979)]
? Time-consuming comparison of online and offline weight parameter training.
[image upload failed ... (image-dad8de-1652190795979)]
? Compared with other heavy parameter strategies.
[Image upload failed ... (Picture-9EE25-1652190795979)]
? Compare detection and segmentation tasks.
? In this paper, an online re-reference method OREPA is proposed, which can re-reference a complex structure into a single rollable layer in the training stage, thus reducing the time-consuming of a lot of training. In order to achieve this goal, linear scaling layer is used instead of BN layer in training, which keeps the diversity of optimization direction and the ability of feature expression. From the experimental results, OREPA has good accuracy and efficiency in various tasks.
?
?
?
?