Current location - Education and Training Encyclopedia - Graduation thesis - Paper reading: image network classification of deep convolution neural network.
Paper reading: image network classification of deep convolution neural network.
This article was produced by AlexNet.

On imagenet lsvrc-20 10 of 20/kloc-0, in the task of classifying * * * 1000 categories of 1.2 million high-resolution pictures, the error rates of top- 1 and top-5 on the test set are respectively. Similarly, top- 1 only predicts one image 1 category), the error rate of top-5 is 15.3% in the competition of ImageNet LSVRC-20 12. AlexNet has 600 million parameters and 650 thousand neurons, including five convolution layers, some of which are followed by max-pooling layer and three fully connected layers. In order to reduce over-fitting, leaky joints are used in fully connected layers, which will be described in more detail below.

The data comes from ImageNet. The training set contains 6.5438+0.2 million pictures, the verification set contains 50,000 pictures, and the test set contains 6.5438+0.5 million pictures. These pictures are divided into 654.38+0.000 categories and have many different resolutions, but the input requirement of AlexNet is fixed resolution. In order to solve this problem, Alex's team used a low sampling rate and reduced the resolution of each picture to 256×256. Given a rectangular image, the specific method is to rescale the image to make the length of its short side 256, and then cut out a picture with the size of 256×256 from the center of the resulting image.

At that time, the standard neuron activation function was tanh () function, and this saturated nonlinear function was much slower than the unsaturated nonlinear function when the gradient decreased. Therefore, the ReLU function is used as the activation function in AlexNet. Figure 1 shows that the training error rate of 25% on CIFAR- 10 data set using ReLU function in 4-layer convolutional network is 6 times faster than using tanh function in the same network under the same conditions.

AlexNet adopts two GTX 580 3G parallel training paths, and puts half of the cores or neurons on each GPU, and the GPUs only communicate at specific layers.

ReLU function does not have a limited range like tanh and sigmoid, so it needs to be normalized after ReLU. The idea of LRN comes from a concept called "lateral inhibition" in neurobiology, which means that the activated neurons will inhibit the surrounding neurons. Calculation formula:

Bi x, y represents the activation value of neurons at position (x, y) after convolution calculation by the i-th convolution kernel and then through ReLU.

Ai x and y represent normalized values.

N represents k convolution kernels adjacent to convolution kernel I, which is a hyperparameter and is generally set to 5.

N represents the total number of convolution kernels.

α = 10? 4, and β = 0.75.

Overlapping pools means that there is an overlap between adjacent pool windows. More precisely, the pool layer can be regarded as a grid of pool units with an interval of s, and each pool unit summarizes a neighborhood with a size of z × z around the position of the merged unit, that is, the pool size is z and the step size is S. When S.

The output of the last layer (Full8) of the network is fed to the softmax layer containing 1000 cells, which is used to predict 1000 tags. The response normalized layer is located after 1 and the second convolution layer, and the maximum pooled layer is located after the response normalized layer and the fifth convolution layer. The ReLU activation function is applied to all convolution layers and fully connected layers.

In the early days, the most common method to reduce the over-fitting of image data is to artificially increase the data set. AlexNet uses two methods to increase the amount of data:

First, specular reflection and random clipping.

Firstly, the image is mirrored, and then 227×227 blocks are randomly selected from the original image and the mirrored image (256×256). In this way, the size of the training set is increased by 2048 times, although the obtained training samples will be highly interdependent. But not using this method will lead to serious over-fitting, forcing us to use smaller networks. During the test, AlexNet will extract five test samples and their mirror images (a total of *** 10 blocks, four corners and center positions) for prediction, and the prediction result is the average value of this 10 block of softmax.

Second, change the intensity of RGB channels in the training image.

PCA (Principal Component Analysis) is performed on the RGB pixel value set of the whole ImageNet training set. For each picture, the multiples of the found principal components are added, and the size is proportional to the corresponding eigenvalues, and then multiplied by a random variable drawn by a Gaussian distribution with a mean value of 0. 1.

Pi and λi are the i-th eigenvector and eigenvalue of the 3 × 3 covariance matrix of RGB pixel values, respectively. αi is the random variable mentioned above. For all pixels of a specific training image, each αi is drawn only once until the image is used for training again, at which time it is redrawn. This scheme approximately captures an important feature of natural images, that is, object recognition does not change with the change of illumination intensity and color.

The inactivation probability set in AlexNet is 0.5. During the test, the used neurons were reused, but their output was multiplied by 0.5.

AlexNet adopts random gradient descent algorithm, the batch size is 128, the momentum attenuation parameter is set to 0.9, and the weight attenuation parameter is 0.0005. The weight attenuation here is not only a normalizer, but also reduces the training error of the model, and the weight updating process becomes: where is the iteration index, momentum variable, learning rate and the average value of the first batch of gradients.

In addition, in AlexNet, the weights of layers are initialized to Gaussian distribution, with an average value of 0 and a standard deviation of 0.00 1, and the offsets of the 2nd, 4th and 5th convolution layers and fully connected layers are initialized to 1. The advantage of this is that by giving ReLU function a positive excitation, the early learning speed is accelerated. The offset of other layers is initialized to 0.