Current location - Education and Training Encyclopedia - Graduation thesis - Detailed explanation of SPP-net article
Detailed explanation of SPP-net article
The article "spatial pyramid pool in deep convection network for visual recognition" is an improvement of RCNN (see R-CNN article for a detailed explanation of RCNN). First of all, the overall framework of SPP-net is as follows.

This paper mainly improves two points:

1.CNN needs to fix the size of the input image, which leads to unnecessary loss of accuracy.

2.R-CNN performs repeated convolution calculation on candidate regions, which leads to calculation redundancy.

1.cnnWhy is the size of the input image fixed?

CNN is mainly composed of convolution layer and fully connected layer. The convolution layer operates in sliding window mode and outputs a feature map representing the spatial distribution of each response activation. In fact, the convolution layer does not need a fixed size input image, and can generate a feature map of any size. On the other hand, a fully connected layer requires an input with a fixed size/length. Therefore, the constraint of fixed input size only comes from the fully connected layer that exists in the deeper stage of the network.

Solutions (compare R-CNN and SPP-net):

As shown in the following figure

The image in the first line is an image processing method that CNN requires a fixed size input.

The second behavior requires a fixed size input CNN (such as R-CNN) processing flow. First, the image is processed like the first line, then the convolution and full connection layer are input, and finally the result is output.

The third behavior is the processing mode of SPP-net, which does not fix the size of the image, but directly inputs it into the convolution layer for processing. The features after convolution are not directly input to the fully connected layer, but are first processed by the SPP layer, then the fixed-length output is transmitted to the fully connected layer, and finally the results are output.

2. Why does R-CNN have computational redundancy?

As shown in the following figure

For a picture, R-CNN first extracts about 2000 candidate regions by segmentation selection method, and then sends these 2000 candidate regions to the network respectively, that is, a picture will undergo 2000 forward propagation, which will cause a lot of redundancy.

On the other hand, SPP-net puts forward the corresponding mapping relationship between candidate regions and the feature map of the whole graph. Through this mapping relationship, the feature vectors of candidate regions can be obtained directly, and it is not necessary to use CNN repeatedly for feature extraction, which greatly shortens the training time. Each image only needs to be propagated forward once.

Let's explain the improved method in detail:

1.SPP layer (spatial pyramid pool)

The first thing to be clear is the position of this layer, which is added between the last convolution layer and the fully connected layer, in order to output features of fixed length to the fully connected layer that needs fixed input.

The structure of SPP layer is shown in the following figure.

Input of SPP layer:

As shown in the gray box below.

The feature of the convolution output of the previous layer (we call it the feature map) is shown in the black part of the figure below, and the input of the SPP layer is an area on the feature map corresponding to the candidate area.

The above sentence may be a bit circuitous. We can understand that a picture has about 2000 candidate regions, and a picture is convolved to get a feature map, and there are also about 2000 regions corresponding to these candidate regions on this feature map (the corresponding relationship here will be explained in detail below).

Output of SPP layer:

The SPP layer is divided into three pool structures (1x 1, 2x2, 4x4). Each input (in which each input has a different size) is taken as the largest pool (used in this paper), and then the features are connected together, that is, (16+4+ 1).

Regardless of the size of the input image, the feature is fixed to (16+4+ 1)x256. In this way, the output of the SPP layer is always (16+4+ 1)x256 feature vector, regardless of the size of the candidate region in the image.

2. The mapping relationship between the candidate region and the original image and feature map.

This part of calculation is actually the calculation of the size of the receptive field.

In CNN, the receptive field refers to the area of the upper layer corresponding to an element in the output result of a certain layer, as shown in the following figure.

First, define several parameters. The definition of parameters refers to Andrew Ng's definition of symbols in Cusella's explanation, and then explains how to calculate.

The input size has the following relationship with the output size:

The above is the corresponding relationship of area size. Let's look at the correspondence between coordinate points.

SPP-net simplifies the above coordinate correspondence, and the simplification process is as follows:

Is a coordinate value, it is impossible to take decimals, which can be considered basically. The formula is simplified: the coordinates of the center point of the receptive field are only related to the upper layer.

Then, the method shown below is the mapping method of SPP-net, which maps the upper left corner and the lower right corner of the original ROI to the corresponding two points on the feature map. With the two corners on the feature map, the corresponding feature map area (orange below) is determined.

The mapping relationship from the original coordinates to the coordinates in the feature map is as follows

The core idea of SPPNet ends here. The back of the SPPNet network is similar to that of R-CNN. Please refer to the article of R-CNN for a detailed explanation. The article also uses these ideas to do the experiment of image classification. If you are interested, you can read the classified part of the original text carefully.

Reference: