Convolutional neural network, after sleeping for nearly 20 years, has become one of the most important network structures in deep learning. From LeNet with the initial five-layer structure, to VGG with the 19 layer structure, and then to Highway Networks and ResNet which crossed the 100 layer network for the first time, the deepening of the network layer has become one of the main development directions of CNN.
With the increase of CNN network layers, the problems of gradient disappearance and model degradation appear in front of people, and the widespread use of batch processing has alleviated the problem of gradient disappearance to some extent. ResNet and highway network set up bypass by constructing identity mapping, which further reduces the occurrence of gradient disappearance and model degradation. Fractal network parallelizes networks with different depths, which ensures the propagation of gradient while obtaining depth. Random deep networks make some layers in the network inactive. It not only proves the redundancy of ResNet depth, but also alleviates the above problems. Although these different network frameworks have deepened the number of network layers through different implementations, they all contain the same core idea, that is, connecting feature maps across network layers.
He Mingkai made an assumption when he proposed ResNet. If a deep network has several layers and another shallow network can learn identity mapping, the performance of the model trained by this deep network will not be weaker than this shallow network. Generally speaking, if some layers that can learn identity mapping are added to a network to form a new network, the worst result is that these layers in the new network will become exactly the same mapping after training, which will not affect the performance of the original network. Similarly, DenseNet also made the assumption that feature reuse is a better feature extraction method than learning redundant features many times.
One of the main advantages of ResNet is that the gradient can flow through the identity function to reach the upper level, but the superposition method of identity mapping and nonlinear transformation output is addition, which destroys the information flow in the network to some extent.
In order to further optimize the spread of information flow, DenseNet proposed a graphic network structure.
As shown in the figure, the input of layer I is not only related to the output of layer i- 1, but also related to the output of all previous layers. note:
Because the feature maps of different layers need to be cat-operated in DenseNet, the feature maps of different layers need to keep the same feature size, which limits the realization of downsampling in the network. In order to use downsampling, the author divides DenseNet into several Denseblock, as shown in the following figure:
In the same dense block, the feature size should be kept the same, and a transition layer should be set between different dense blocks to realize downsampling. In the author's experiment, the transition layer consists of BN+conv (1× 1)+2× 2 average pool.
In Denseblock, assuming that the output of each nonlinear transformation H is K feature maps, then the input of the I-layer network is K 0+(I-1) × K. Here we can see a major difference between DenseNet and the existing network: DenseNet can accept fewer feature maps as the output of the network layer, as shown in the following figure.
The reason is that each layer in the same Denseblock is associated with all previous layers. If we regard the feature as a global state of Denseblock, then the training goal of each layer is to judge the updated value that needs to be added to the global state through the existing global state. Therefore, the number k of feature maps output by each network layer is also called the growth rate. It also determines how much information each layer needs to update the global status. As we will see later, in the author's experiment, only a small K is needed to achieve the most advanced performance.
Although DenseNet accepts less k, that is, the number of feature maps as output, because the feature maps of different layers are combined by cat operation, the channel of the final feature map will be very large, which will become the burden of the network. The author uses 1× 1 Conv (bottleneck) as a feature dimension reduction method to reduce the number of channels. In order to improve the computational efficiency, the improved nonlinear transformation becomes BN-RELU-Conv (/kloc-0 /×1)-BN-RELU-Conv (3× 3). DenseNet using bottleneck layer is called Densenet-B, and 1× 65433 is used in the experiment.
In order to further optimize the simplicity of the model, we can also reduce the number of feature maps in the transition layer. If a Denseblock contains m feature maps, then we can generate the transition layer of its output connection. θm? Output feature map. Where θ is the compression factor, and when θ= 1, the original feature size of the transition layer will remain unchanged.
DenseNet compressed with θ=0.5 is named DenseNet-C, and DenseNet compressed with θ=0.5 is named DenseNet-BC.
Because DenseNet does cat operation on the input, an intuitive effect is that the feature map learned by each layer can be directly used by all subsequent layers, making the features reusable in the whole network and the model more concise.
From the above figure, we can see the parameter efficiency of DenseNet: the left figure contains the statistics of various structural parameters and final performance of DenseNet, and we can see that when the model achieves the same test error, the original DenseNet often has 2-3 times more parameters than DenseNet-BC. The middle diagram shows the comparison between DenseNet-BC and ResNet. Under the same model accuracy, DenseNet-BC only needs about one third of the parameters of ResNet. On the right is the comparison between ResNet whose parameters exceed 100 1 layer and DenseNet-BC whose parameters are only 0.8M 100 layer, although they converge on about the same training epoch DenseNet-BC.
Another reason why DenseNet has such high performance is that each layer in the network not only accepts the supervision from the loss in the original network, but also has multiple bypasses and shortcuts. The supervision of the network is varied. The advantages of in-depth supervision are also confirmed in the in-depth supervision network (DSN). (Each hidden layer in DSN has a classifier, forcing it to learn some distinctive features. Different from DSN, DenseNet has a single loss function, which makes model construction and gradient calculation easier.
DenseNet was designed at the beginning of its design as a network structure in which a one-layer network can use the feature graphs of all the previous layers. In order to explore the reuse of features, the author carried out related experiments. The author uses DenseNet trained by L = 40 and K = 12 to calculate the average absolute value of the weight of the feature map of the previous layer for all convolution layers in any Denseblock. This average shows the utilization ratio of this layer to the previous feature layer, and the following figure shows the heat map drawn by this average:
From the figure, we can draw the following conclusions:
A) Some features extracted from earlier layers can still be directly used by deeper layers.
B) Even the transition layer will use the features of all layers in the previous Denseblock.
C) Compared with the previous transition layer, the utilization rate of the layers in the 2nd-3rd Dense Block is very low, which indicates that the transition layer outputs a lot of redundant features. This also provides evidence support for DenseNet-BC, which is the necessity of compression.
D) Although the last classification layer uses the multi-layer information in the previous Denseblock, it prefers to use the features of the last few feature graphs, which indicates that some advanced features may be generated in the last few layers of the network.
The author trained several DenseNet models on several benchmark data sets and compared them with the most advanced models (mainly ResNet and its variants):
As can be seen from the above table, DenseNet only needs a small growth rate (12,24) to achieve the most advanced performance, while the number of parameters of DenseNet-BC combined with bottleneck and compression is much less than that of ResNet and its variants, and both DenseNet and DenseNet-BC have achieved performance beyond ResNet on the original data set and augmented data set.
DenseNet explained in detail