Notes on "Rethinking the Initial Architecture of Computer Vision"

After watching V 1 and Inception V2, it's time to watch v3. In the video recommendation made by Slag Wave, the network for extracting video features used Inception v3. But the author's title didn't use v3, but began to "reflect".

20 18. 12.29 2777 times. And v 1, v2 is not an order of magnitude at all.

20 15 12 was published in arXiv. A few days earlier than the late ResNet. One work has returned to Christian Szegedy, the author of v 1, and Sergey Ioffe, the author of v2, is three works this time. It seems that they are good friends and always appear in groups.

1 puts forward some general principles for designing network architecture.

2 various decomposition and convolution programs

In this paper, various methods of expanding the network are explored, with the goal of using the increased computation as effectively as possible through proper convolution solution and effective regularization.

Compared with VGG and AlexNet, the computation and parameters of Inception are greatly reduced, so it can be used for big data and mobile devices. However, if you simply extend the architecture, most of the advantages of computing may be lost immediately.

This paper will introduce some general standards and optimization concepts to expand convolutional networks more effectively.

The author repeatedly stressed that the above is only part of the experience, and the actual use depends on the specific situation. .

Most of the initial benefits of GoogLeNet came from the widely used dimension reduction technology. This can be considered as a special case of deconvolution with more efficient calculation.

Because the initial network is completely convolution, each weight corresponds to each activated multiplication, so any reduction in calculation cost will also lead to a reduction in the total number of parameters. This means that through proper decomposition, more separable parameters can be obtained and training can be accelerated.

Larger filters (such as 5x5 or 7x7) tend to be computationally expensive.

Replace one 5x5 convolution with two 3x3 convolutions.

The control experiment proves the effectiveness of the strategy.

It is natural to think whether 3x3 can be further reduced to 2x2. In contrast, splitting 3x3 into two 2x2 can only save 1 1% of the computation, while using 3x 1 and 1x3 can save 33%.

Theoretically, we can go further and replace nxn with 1xn and nx 1. In practice, it is found that this decomposition is not effective in the early layer, but it is very effective in the middle layer (for the characteristic map of m×m, m is between 12 and 20).

The original motivation of the auxiliary classifier used in Inception-v 1 is to overcome the problem of gradient disappearance in deep network and make the useful gradient available to shallow layer immediately. However, through experiments, it is found that this does not improve the convergence in the early training stage, but it is slightly better than the network without auxiliary classifier in the late training stage. This shows that the assumptions in Inception -v 1 are wrong (slapping yourself, appreciating yourself, admitting it bravely, and not fooling around).

Traditionally, convolutional networks reduce the size of feature maps through aggregation operations. In order to avoid the representative bottleneck, the dimension of the filter is extended before the average or maximum pool is performed.

Although the left figure of Figure 9 reduces the grid size, which violates the general principle of 1 and introduces the bottleneck prematurely, the right figure does not violate it, but it brings three times the calculation amount.

Figure 10 gives the solution, that is, introducing two parallel block:P: p with two spans: p and c, and then connecting them. This not only costs less, but also avoids the bottleneck of presentation.

(Although this is the official definition of v2, everyone seems to think that BN-Inception is v2? )

The initial 7×7 volume integral is decomposed into three 3×3 convolutions, and the initial blocks with different structures (Figures 5, 6 and 7) are used, with a total of ***42 layers, and the calculation cost is only 2.5 times higher than that of v 1.

A mechanism is proposed to regularize the classification layer by estimating the marginalization effect of label loss during training.

The causes of overconfidence caused by cross entropy loss are analyzed. Put forward an incentive model mechanism to reduce this self-confidence.

For the sample marked with Y, replace the label distribution with

In the experiment, uniform distribution is used, so the formula becomes

This change to the real label distribution is called label smoothing regularization LSR.

Another explanation of LSR is from the perspective of cross entropy.

The second loss punishes the deviation of the predicted label distribution p from the previous u. This deviation can also be equivalently captured by KL divergence, because it is fixed.

In ILSVRC20 12, set,,. Brought a 0.2% increase.

Tensorflow is used to train 50 models (this is the first time tf has been used in the Inception series of papers), batch_size=32, epochs= 100. The initial experiment used SGD of driving quantity, decay=0.9. But the best model is RMSProp, decay=0.9. The learning rate is 0.045 and decays once every two epochs (exponential decay rate is 0.94).

In addition, we find that gradient clipping in RNN (setting the threshold to 2.0) can stabilize the training.

Common sense is that the recognition performance can often be significantly improved by adopting a higher resolution receptive field model. If we only change the resolution of the input without further adjusting the model, then in the end, we will use a model with low computational complexity to solve more difficult tasks.

The question becomes: How much can a higher resolution help if the amount of calculation remains the same?

Although the low-resolution network needs long-term training, the final effect will not be too bad. However, simply reducing the size of the network according to the input resolution usually leads to bad results.

The conclusion of this part is: Do higher resolution inputs use more complex models? )

In this paper, all the generalized Inception-v2 complexes are called Inception-v3.

I think there is an error in table 4. The titles of Top-5 and Top- 1 are reversed.

Several design standards are provided to extend convolutional networks.

I feel that the structure of Inception is too complicated, full of magical numbers, and it doesn't look as simple and unified as ResNet. In addition, I feel that this article is a bit scattered and patchy. . If it weren't for the promotion of Label smoothing, I should be able to write a special article. The most useful are several design standards, which should be helpful to understand the concept of network design that appeared later.

As for stack explosion, an example of label smoothing with pandas is given.

The official implementation of pytorch is only v3, nothing else.

Shu Hua's section 7.5. 1 (injecting noise into the output target) explains the principle behind label smoothing.

Coal tar paper

Myopia thesis

Chinese examination paper for the fourth grade of primary school

Be a real argumentative person.

Topic composition material-where is home ~

What are the super interesting feelings of reading Chinese books?

Comprehensive product development paper

Xiao Shu's thesis

Ability thesis

Argumentative essay on "reducing the burden"