Current location - Education and Training Encyclopedia - Graduation thesis - Professor Wu's CNN class: Advanced
Professor Wu's CNN class: Advanced
The second week is about the advanced part of Convolutional Network (CNN), and I learned a lot of new things. Because after understanding the basic knowledge of CNN, most people are learning natural language processing from RNN. Many recent developments of CNN have been heard, but they are not clear.

In particular, every paper model likes to take some strange names (for example, two papers I read yesterday, one called NTT and the other called TMD). If I haven't read the newspaper and don't know some reasons, I'm really stupid.

Before I watch the visual question and answer, I must choose a trained network to preprocess the picture. As a result, I saw a lot of incomprehensible nouns such as VGG, Res and Incept, followed by numbers such as 16, 19, 10 1 and versions such as V 1, V2 and V3. The result can only be completely blind, refer to other people's papers and choose one blindly.

Before it's too late, let's get started.

Before mentioning ResNet and Inception Net, it is best to review the development of successful architecture in CNN, so that we can grasp a clear development context and understand it quickly when explaining advanced networks.

The first one is the grandfather LeNet-5, because it came out about 20 years ago. It was still green at that time, but now it was put forward by Le Cun, the four kings of deep learning. LeNet is also taken from the first half of his name, and 5 means there are only five layers, which are used to deal with handwritten numeral recognition such as MNIST.

Professor Wu also drew the whole structure himself in the basic class, so it is not very complicated. After two convolution and aggregation, it can be directly input into the fully connected network.

After LeNet, artificial intelligence has not developed much for a long time because of the cold winter.

Until 13, the rising star Alex Krizhevsky (I believe many people have read his blog and heard of his famous Stanford cs23 1) proposed AlexNet, which was a great success in the ImageNet competition, showing everyone the power of deep learning, and set off a wave from then on.

There is not much difference between AlexNet and LeNet. The main differences are as follows.

After AlexNet demonstrated the success of deep learning in image processing, everyone began to improve the CNN architecture of image processing. VGG is the first very concise system, and puts forward a set of network architecture on how to use a deeper network for training.

The innovation of VGG is that compared with the mixed parameters in the previous network, it arranges the structure of each layer in the architecture in a planned and orderly way. For example, you can list the following points:

VGG was the name of the visual geometry group at that time. Generally, people now use VGG- 16 or VGG- 19, 16, 19, which are the corresponding layers in the VGG version.

The floors of 16 and 19 seem to be many, but they are still a drop in the bucket compared with the giants of floors 10 1 and 152 like ResNet.

Then why not increase the number of layers of VGG?

A big problem in deep neural network training is that when there are too many layers, the so-called explosion gradient and vanishing gradient will appear.

For example, the coefficient of each gradient multiplication is less than 1. If it is 0.6, the 0.6 power of 19 is already 0.0006 1, not to mention hundreds of layers. When this situation spreads to the lowest level, the parameters that can be learned are very small, that is, the gradient disappears.

On the contrary, in the case of gradient explosion, if the coefficient is greater than 1, a large number will be obtained after repeated multiplication, which will also cause bad results. It's a bit like the concept of compound interest, that is, betting a chessboard of rice with the king.

The gradual explosion is easy to solve, and it can be trimmed by gradual clipping. And the disappearance of the gradient is a bit difficult to solve.

Because of this problem, although in theory, the deeper the ordinary deep net, the better, but in fact it backfired. In the figure below, the horizontal axis is the number of network layers and the vertical axis is the training error (the smaller the better). Theoretically, with the increase of layers, the error will decrease, but in fact, after a certain point, the error will increase.

So how did ResNet practice to hundreds of floors? Aren't you afraid of the above problems?

It solves the above problems in a very simple way.

There are many great papers. Before I read them, I thought these names were very powerful. But as a result, when I read the paper, I found that the methods were very concise, especially some simple solutions derived from mathematical knowledge, which made people have to clap their hands and sigh at their own mathematical scum.

ResNet is the same, the name is cool, but when I open the paper, I find it is like this.

The most important thing about ResNet is the concept of shortcuts. Let's see what this is. First of all, suppose we take two layers from a neural network as a block, whether it is MLP or CNN in the middle.

Shortcut, as shown in the figure below, directly creates a shortcut from the input of the first layer to the output of the second layer before activating this function.

That is to say, the output of the second layer can change like this after activation.

pass by

become

After this processing, the small block we get is called the residual block, and the superposition of these blocks is our residual network. It's simple, like below, a 34-story remnant net.

In this way, we get the residual network, so what will happen to the actual training?

It is exactly the model we want, and it is also very consistent with the theory.

Finally, let's talk about the principle behind the residual network and why it can have such good performance by simply changing it.

The reason is that a shortcut is established, so that each residual block can easily learn the identity function, that is, f (x) = x, that is to say, after adding the residual block, you can learn the identity function, keep the information, and directly pass on the previous results, so you don't have to worry about the disappearance of the gradient mentioned above.

Before introducing Inception space network, first introduce a very important concept, 1x 1 convolution.

At first glance, the concept of 1x 1 convolution will be strange. What's the use of convolving one pixel at a time like this? We initially wanted to detect local features. But if you understand the concept of channel, you can understand it well.

Because, if ordinary large window convolution pays more attention to the interaction of each feature in a channel, then 1x 1 convolution is only a convolution operation between channels, which strengthens the interaction between channels.

This is the first point, which strengthens the communication between channels. With this, you can only do some operations on channels, such as using 1x 1 convolution to increase and decrease the number of channels, or you can sort out the previous channels without increasing or decreasing them.

Another advantage of 1x 1 convolutional network is that it can reduce the calculation of the whole network through reasonable application.

Let's give an example. Suppose it is like a downward convolution process.

Then the calculation amount required for the above process is about1.200 million times.

And if we skillfully use 1x 1 convolution to deal with the channel reasonably, as follows.

The amount of calculation will become only about12 million, which will be reduced tenfold at once.

With the above knowledge, everything will be much easier. The core idea of Inception Network is that since we should consider the window size of the half-day filter when choosing CNN architecture, why not use all the sizes and finally connect the results?

So there is the following Inception module. The convolution of 1x 1, 3x3, 5x5 is useful, and the aforementioned technique of using 1x 1 convolution to reduce the amount of calculation is also used.

Finally, as for the whole Inception network, just like the previous remnant network, just stack the Inception modules.

Of course, there are some small details in the article, so I won't say much here.

And since the original paper, Inception Network has also improved a lot, adding a lot of skills, such as the skills in the previous remnant network. So now there will be suffixes such as V 1, V2, V3 on the Internet, indicating all versions.

Finally, as a conclusion, let me talk about why Inception Network is called Inception Network.

Because in this serious academic paper, the author quoted a link, and this picture will appear after clicking it.