? BatchNorm is the core computing component in deep learning. It is used in most SOTA image models and has the following advantages:
? However, although BatchNorm is good, it still has the following shortcomings:
? At present, many researches have begun to look for normalization layers to replace BatchNorm, but these substitution layers either perform poorly or bring new problems, such as increasing the computational consumption of reasoning. Other studies try to remove the normalization layer, such as initializing the weight of the residual differential branch to make its output zero, so as to ensure that most information is transmitted through skip path at the beginning of training. Although deep networks can be trained, the accuracy of networks using simple initialization methods is poor, and such initialization is difficult to be used in more complex networks.
? Therefore, this paper hopes to find an effective method to train the deep residual network without batch processing, and the performance of the test set can be comparable to the current SOTA. The main contributions of this paper are as follows:
? Many studies have theoretically analyzed the signal propagation of ResNet, but few have verified the feature scaling of different layers in the field when designing or magic network. In fact, forward reasoning with arbitrary input, and then recording the statistical information of different location characteristics of the network, can intuitively understand the information dissemination and find hidden problems as soon as possible, without going through long fault training. Therefore, this paper proposes a signal propagation map (SPPs), in which random Gaussian input or real training samples are input, and then the following information output by each residual block is counted separately:
? In this paper, the common BN-ReLU-Conv structure and the uncommon ReLU-BN-Conv structure are statistically tested. The experimental network is a 600-layer ResNet, and how to initialize it is defined as the residual block. The following phenomena can be found in SPPs:
? If BatchNorm is removed directly, the average squared channel mean and average channel variance will increase continuously, which is also the reason why deep networks are difficult to train. Therefore, to get rid of BatchNorm, we must try to simulate the signal transmission effect of BatchNorm.
? According to the previous SPPs, this paper designs a new correction block, which mainly simulates the performance of BatchNorm in terms of mean and variance, as follows:
? According to the above design, given the sum, the variance of the output of the first residual block can be directly calculated according to the following formula. In order to simulate that the cumulative variance in ResNet is reset in the transition block, it is necessary to reduce the input of the skip path of the transition block to 0 to ensure that the output variance of the transition block at the beginning of each stage is satisfied. Resnet without normalizer (NF-Resnet) is obtained by applying the above simple scaling strategy to the remaining networks and removing the BatchNorm layer.
? In this paper, how to initialize NF-ResNet is analyzed by SPPs. The results are shown in Figure 2, and two unexpected phenomena are found:
? In order to verify the above phenomenon, this paper removes ReLU from the network and analyzes it with SPPs. As shown in Figure 7, after removing ReLU, the mean square of the average channel is close to 0, and the output of the remaining differential branches is close to 1, indicating that ReLU has caused the mean shift phenomenon.
? The article also analyzes this phenomenon from a theoretical point of view. First of all, the transformation is defined as an arbitrary fixed matrix, which is an element-by-element activation function acting on independent and identically distributed inputs, so it is also independent and identically distributed. Assuming that each dimension has a sum, the mean and variance of the output are:
? Where the sum is the mean and variance of the fan-in:
? When ReLU's function is activated, it means that the inputs of subsequent linear layers are all positive averages. If so, then. Because, if it is also non-zero, there is also a non-zero mean. It should be noted that even if sampling from the distribution with zero mean, the actual matrix mean will definitely not be zero, so the output of any dimension of the residual differential branch will not be zero, and with the increase of network depth, the training is more and more difficult.
? In order to eliminate the mean-shift phenomenon and ensure the variance invariance of the residual difference branch, this paper draws lessons from the weight standardization and the center weight standardization, and puts forward the method of Scaled WS, which reinitializes the weight of the convolution layer as:
? Sum is the mean and variance of convolution kernel fan-in, and the weights are Gaussian weights at first, which are fixed constants. Substituting the formula 1, we can draw the conclusion that, yes, the mean shift phenomenon is eliminated. In addition, the variance becomes, and the value is determined by the activation function used, which can keep the variance unchanged.
? Scaled WS has little overhead in training, has nothing to do with batch data, and has no overhead in reasoning. In addition, the computational logic during training and testing is consistent, and it is also friendly to distributed training. As can be seen from the SPPs curve in Figure 2, the performance of NF-ResNet-600 with scaling WS is very similar to ReLU-BN-Conv.
? The last factor is to determine the value to ensure that the variance of the residual differential branch output in the initial stage is close to 1. This value is determined by the type of nonlinear activation used by the network. Assuming nonlinear input, ReLU output is equivalent to sampling from Gaussian distribution with variance. Because, you can set the guarantee. Although the actual input is not completely consistent, the above settings still have a good performance in practice.
? For other complex nonlinear activations, such as Lu Si and Swish, the formula derivation will involve complex integrals, and even cannot be derived. In this case, a numerical approximation can be used. Firstly, multi-dimensional vectors are sampled from Gaussian distribution, and the actual variance of the active output of each vector is calculated, and then the square root of the average of the actual variance is taken.
? The core of this paper is to maintain correct information transmission, so many common network structures need to be modified. Just like choosing values, necessary modifications can be judged through analysis or practice. For example, the output of SE module needs to be multiplied by the weight, which leads to weakened information transmission and unstable network. Using the numerical approximation mentioned above for separate analysis, it is found that the expected variance is 0.5, which means that the output needs to be multiplied by 2 to restore the correct information transmission.
? In fact, sometimes relatively simple network structure modification can maintain good information transmission, and sometimes even if the network structure is not modified, the network itself can be very robust to the information attenuation caused by the network structure. Therefore, this paper also tries to test the maximum relaxation of the constraints of the scalable WS layer on the premise of keeping the training stable. For example, in order to restore some convolution expression ability of the scaled WS layer, a learnable scaling factor and deviation are added for weight multiplication and nonlinear output addition respectively. When these learnable parameters are unconstrained, the stability of training is not greatly affected, but it is helpful for network training above 150 layer. So NF-ResNet directly relaxed the constraints and added two learnable parameters.
? The appendix of the paper has detailed details of network implementation, and those who are interested can go and have a look.
? To sum up, the core of non-normalizer ResNet has the following points:
? Compared with other methods, the normalizer-free variant of RegNet is almost the same as that of EfficientNet, but it is very close.
? In this paper, NF-ResNet is proposed, which is analyzed according to the actual signal transmission of the network, and the performance of BatchNorm in mean and variance transmission is simulated, thus replacing BatchNorm. The experiment and analysis of the paper are very full and the effect is very good. The theoretical effect of some initialization methods is correct, but there will be deviations in actual use. This paper finds this point through practical analysis, in order to supplement and implement the truth that practice gives true knowledge.
?
?
?
?