Current location - Education and Training Encyclopedia - Graduation thesis - , CIFAR- 10/ 100. Their ability can be controlled by changing the depth and width, and they also make strong and correct assumptions about the properties of the image (i.e. statistical stationarity an
, CIFAR- 10/ 100. Their ability can be controlled by changing the depth and width, and they also make strong and correct assumptions about the properties of the image (i.e. statistical stationarity an
, CIFAR- 10/ 100. Their ability can be controlled by changing the depth and width, and they also make strong and correct assumptions about the properties of the image (i.e. statistical stationarity and pixel-related locality).

Therefore, compared with the standard feedforward neural network with similar layers, CNN has fewer connections and parameters, so it is easier to train, and its theoretical optimal performance may only be slightly worse.

Although the quality of CNN is attractive and its local architecture is relatively efficient, it is still very expensive to apply it to high-resolution images on a large scale. Fortunately, the combination of current gpu and highly optimized 2D convolution implementation is powerful enough to train interesting large CNN conveniently, while recent data sets (such as ImageNet) contain enough tag examples to train such models without serious over-fitting.

The specific contributions of this paper are as follows:

Finally, the size of the network is mainly limited by the available memory on the current gpu and the training time we are willing to endure. Our network needs 5-6 days to train two GTX 580 3GB GPU. All our experiments show that our results can be improved as long as we wait for a faster gpu and a larger data set to be available.

2 data set

ImageNet is a data set containing more than 654.38+05 million high-resolution images, belonging to about 22,000 categories. These pictures were collected from the Internet and marked by a manual labeling machine using Amazon's Turkish mechanical crowdsourcing tool. Starting from 20 10, as part of Pascal Visual Object Challenge, a competition named ImageNet Large Visual Identification Challenge (ILSVRC) is held every year. ILSVRC uses a subset of ImageNet, and each category has about 1000 pictures. The training image is about 6.5438+0.2 million, the verification image is about 50,000, and the test image is about 6.5438+0.5 million.

ILSVRC-20 10 is the only version of ILSVRC available for test set labels, so this is the version we conducted most experiments on. Since we also joined our model in the ILVRC-2012 competition, in Section 6, we also reported our results on this version of the data set, and the test set label of this data set is not available. On ImageNet, two error rates are usually reported: top- 1 and top-5, in which the top-5 error rate is a part of the test image, and the correct tags are not among the five most likely tags that the model thinks.

ImageNet consists of images with variable resolution, but our system needs constant input dimensions.

Therefore, we downsample the image to a fixed resolution of 256 * 256. Given a rectangular image, we first resize the image to make its short side length 256, and then cut out 256%256 blocks from the center of the resulting image. We didn't preprocess the image in any other way except subtracting the average activity on the training set from each pixel. Therefore, we train the network as the original RGB value of the pixel (centered).

3 architecture

3. 1 ReLU nonlinearity

3.2 Training on multiple GPUs

3.3 Standardization of local response

3.4 Overlapping Pool

The pool layer in CNN summarizes the outputs of adjacent neuron groups in the same nuclear graph. Traditionally, neighbors summarized by adjacent pool units do not overlap (for example), but it seems too expensive for a large neural network that has already needed several days of training. But there is a very effective model combination version, which only costs about twice as much when training. The latest technology is called dropout[ 10], which sets the output of each hidden neuron to 0 with a probability of 0.5. Neurons discarded in this way do not participate in forward transmission or backward propagation. Therefore, each input, the neural network will sample different structures, but these structures all enjoy weight. This technique reduces the complex mutual adaptation between neurons, because neurons cannot rely on the existence of specific other neurons. Therefore, it is forced to learn more robust features used in combination with many different random subsets of other neurons. In the test, we use all neurons, but multiply their outputs by 0.5, which is a reasonable approximation, similar to taking the geometric average of the predicted distribution generated by exponential multi-export network.

In Figure 2, we used dropout in the first two fully connected layers. Without dropping out of school, our network shows a lot of over-fitting. Discarding doubles the number of iterations required for convergence.

5 learning details

7 discussion