Current location - Education and Training Encyclopedia - Graduation thesis - Faster R-CNN: Real-time Target Detection Using RPN
Faster R-CNN: Real-time Target Detection Using RPN
Paper: Faster r-CNN: Real-time target detection with regional suggestion network.

Most target detection networks rely on regional proposal algorithm to assume the location of the target. R-CNN uses selective search algorithm to propose possible ROI(regions of interest, and then uses standard CNN to classify each extracted region. The selective search method is to set 2000 candidate areas with different shapes, sizes and positions around the target object, and the possibility that the target object is in the candidate areas is still relatively large. Then these areas are convolved to find the target object, although most areas are useless. This method is much more effective than finding almost 100 regions.

Fast R-CNN does not generate candidate regions in the original image, but first obtains the feature map of the whole image through convolution network, then uses candidate region algorithm to get the mapping of regions of interest in the feature map, and then uses Rol Pool to change all regions into the same size, which greatly reduces the running time of these target detection networks, but the calculation of region generation becomes the bottleneck of the whole detection network.

The faster R-CNN introduces the Region Proposal Network (RPN), which shares the convolution characteristics of the input image with the detection network, thus making it possible to generate regions at a time cost close to zero. RPN is a full convolution network, which can predict the target boundary and target score of each position at the same time. After end-to-end training, RPN can generate high-quality regional candidate frames, and then provide them to Fast R-CNN for detection.

Fast R-CNN consists of two modules: the first module is a deep full convolution network generated by regions, and the second module is a fast R-CNN detector using candidate regions. The whole system is a single and unified target detection network. Using the recently popular neural network term "attention" mechanism, RPN module tells Fast R-CNN module where to look for the target.

For pictures, you need to get the following output:

The first step of fast R-CNN is to use CNN model based on classification task (such as ImageNet) as feature extractor. The input picture is expressed in the form of H × W × D, and the conv feature map is obtained by pre-training CNN model.

The faster R-CNN first adopted ZF and VGG trained in ImageNet, and then many other networks with different weights appeared. For example, MobileNet is a small and efficient network structure with only 3.3M parameters. ResNet- 152 parameter reaches 60M;; New network structures, such as DenseNet, improve the results and reduce the number of parameters.

Take VGG 16 as an example:

When VGG 16 pictures are classified, the tensor of 224×224×3 (that is, RGB pictures with 224×224 pixels) is input. Finally, the network structure uses FC layer (not Conv layer) to get fixed-length vectors for image classification. The output of the last convolution layer is stretched into a vector of 1 dimension and then sent to the FC layer. The output of conv5/conv5_ 1 is officially realized.

In depth, the convolution feature map encodes all the information of the picture, while maintaining the position of the "things" encoded relative to the original picture. For example, there is a red square in the upper left corner of the picture, and the convolution layer has an activation response, so the information of this red square is still in the upper left corner of the convolution feature map after being encoded by the convolution layer. Therefore, it is feasible to detect the position of the target by using the feature map.

ResNet structure gradually replaces VGG as the basic network for feature extraction. The obvious advantage of ResNet over VGG is that the network is larger, so the learning ability is stronger. This is very important for classification tasks, and it should be the same in target detection. In addition, ResNet uses residual connection and BN (batch normalization) to make the training of depth model easier.

Then, RPN (Regional Suggestion Network) processes the extracted convolution feature map to find a predetermined number of regions (bounding boxes) that may contain the target. In order to generate candidate regions, the convolution feature map output by the last convolution layer is 3×3 convolved. There are 5 12 convolution kernels * * followed by ReLU, so that each 3x3 region will get a 5 12-dimensional feature vector. Then, the feature vector is input to two completely connected layers-bounding box regression layer (reg) and bounding box classification layer (cls).

Let's explain the meanings of k, 2k, 4k and 4k.

In target detection based on deep learning, the most difficult problem is to generate bounding boxes with variable length, which are rectangles with different sizes and aspect ratios. When constructing a deep neural network, the final network output is generally a tensor output with a fixed size (except RNN). For example, in picture classification, the network output is a tensor of (c,), where c is the number of category labels, and the scalar value at each position of the tensor represents the probability that the picture is a category.

In RPN, anchor points are used to solve the problem that the length of bounding box list is uncertain, that is, reference bounding boxes with fixed size are uniformly placed in the original image. As mentioned above, RPN convolves the feature map by 3×3, assuming that each convolution needs to predict k candidate regions, so the reg layer has 4k outputs to encode the coordinates of k bounding boxes, and the cls layer outputs 2k scores to estimate the probability that each region is a target or a background. These k regions are initialized by k reference bounding boxes, and the k reference bounding boxes are k positioning points, which are used as reference boxes to predict the target position for the first time. The center of the anchor point is located in the center of the sliding window of the convolution kernel. By default, each sliding position uses three different scales (128 2,256 2,5122) and three different aspect ratios (1: 2, 1: 1, 2: 1). For a convolution feature graph with the size of W×H (generally around 2400), there are always W×H×k anchor points. For the last two fully connected layers of RPN, the number of parameters is 512× (4+2 )× k. 。

Instead of directly detecting the position of the target, the problem is transformed into two parts here. For each anchor point:

There is a simple way to predict the bounding box of the target, which is to learn the offset from the reference bounding box. Assume that the reference frame: () and the quantity to be predicted: () are generally very small values to adjust the reference frame for better fitting.

Although the anchor point is defined based on the convolution feature map, the final anchor point is relative to the original picture.

Because there are only convolution layer and pool layer, the dimension of the feature map is proportional to the size of the original map. That is to say, mathematically, if the picture size is w×h, then the size of the feature map is w/r×h/r, where r is the sub-sampling rate. If the anchor is defined in the spatial position of the convolution feature map, the final picture will be an anchor set divided by r pixels. At VGG, r= 16.

RPN uses all reference anchor points to output good suggestions for a series of goals. For each anchor point, there are two different outputs:

RPN is a completely convolutional network.

For the classification layer, each anchor point outputs two predicted values: the anchor point is the score of the background (non-object) and the anchor point is the score of the foreground (object).

For regression layer, which can also be called bounding box adjustment layer, each anchor point outputs four predicted values:

(Δ xcenter, Δ ycenter, Δ width, Δ height), which is used for the anchor to obtain the final proposal. According to the final proposed coordinates and their corresponding objective scores, a good object proposal can be obtained.

RPN has two types of predictive value output: binary classification and bounding box regression adjustment.

In order to train RPN, we assign each anchor a binary category label (target or non-target). We assign positive tags to two kinds of anchor points: (i) the anchor point with the highest overlap rate (IoU) with the actual bounding box, or (ii) the anchor point with an overlap of more than 0.7 IoU with the actual bounding box. Note that a single real bounding box can assign positive tags to multiple anchor points. Usually the second condition is enough to determine a positive sample; But we still use the first condition, because in some rare cases, the second condition may not find a positive sample. For all real bounding boxes, if the IoU ratio of anchor points is lower than 0.3, we assign negative labels to non-positive anchor points. Anchors that are neither correct nor irresponsible will not help to train the objective function.

Then, the anchors are randomly sampled to generate a mini-batch with batchsize=256, and the proportion of foreground and background anchors is kept as balanced as possible.

RPN uses binary cross entropy to calculate the classification loss of all anchors in small batch. Then, the regression loss is calculated only for the fish marked as foreground in the small batch. In order to calculate the target of regression, according to the foreground anchor point and its nearest groundtruth object, the offset value δ of transforming the anchor point into the object groundtruth is calculated.

Faster R-CNN does not use simple L 1 or L2 loss for regression error, but uses smooth l 1 loss. Smooth L 1 and L 1 are basically the same. When the error value of L 1 is small, it is considered that the loss will disappear at a faster speed if it is close to the correct value.

Because the anchor points generally overlap, the candidate areas of the same target will also overlap.

In order to solve the problem of overlapping proposals, NMS algorithm is used to deal with them, and the proposals with IoU greater than the preset threshold and higher scores are discarded.

Although NMS looks simple, the default value of IoU threshold needs to be handled carefully. If the debt value is too small, some suggestions; Many items may be lost. If the IoU value is too large, it may lead to many proposals in objects. IoU is typically 0.7.

After NMS processing, topN suggests sorting by sore. In faster R-CNN papers, N=2000, and its value can be smaller, such as 50, and still achieve high and good results.

When the possible related targets and their corresponding positions in the original image are obtained, the problem is more direct. Using the features extracted by CNN and the bounding box containing related targets, using RoI Pooling to extract the features of related targets to get new vectors.

After RPN processing, we can get a bunch of target proposals without classification scores. The unresolved problem is how to use these bounding boxes and classify them.

The simplest method is to cut each porposal and send it to the pre-trained base network for feature extraction; Then the features will be extracted to train the classifier, but all 2000 proposals need to be calculated, which is inefficient and slow. Faster R-CNN speeds up the calculation efficiency by reusing convolution feature maps, that is, using RoI (Region of Interest) pool to extract fixed-size feature maps for each proposal. Then R-CNN classifies the feature maps with fixed size.

In target detection, including the faster R-CNN, a simpler method is often used, that is, each suggestion is used to cut the convolution feature map, and then the interpolation algorithm (usually bilinear interpolation) is used to adjust each cut to a fixed size of 14× 14×ConvDepth. The final 7×7×ConvDepth feature map of each scheme is obtained by using the largest pool of 2×2 cores.

The exact shape was chosen because of its application in the following module (R-CNN).

R-CNN uses the features extracted by RoI Pooling to classify, and uses the full connection layer to output the classification score of each possible target category, which is the last step of a faster R-CNN framework.

R-CNN has two different outputs:

R-CNN flattened the feature map of each proposal and processed it with ReLU and two fully connected layers, with a size of 4096 dimensions. Then, two different fully connected layers are used for each different target: one fully connected layer has N+ 1 nerve cells, where n is the total number of classes, including background classes; ; There are 4N nerve cells in a fully connected layer, which is the output of regression prediction. N possible categories are obtained, and δ center X, δ center, δ width and δ height are predicted respectively.

The calculation of R-CNN target is basically the same as that of RPN target, but different possible object classes need to be considered.

According to the proposal and ground truth box, calculate the debt. The suggestion that IoU is greater than 0.5 of any basic fact box is set as the correct box. When IoU is between 0. 1 and 0.5, it is set as the background. Ignore suggestions without any cross here. This is because, at this stage, it is assumed that a good proposal has been obtained. Of course, all these superparameters can be adjusted to better suit the object.

The goal of bounding box regression is to calculate the deviation between suggestions and their corresponding basic facts, and only calculate suggestions after setting categories based on IoU thresholds. Randomly adopt a balanced small batch =64, in which 25% of the foreground suggestions (with classes) and 75% of the background suggestions.

Similar to the loss of RPNs, for the selected scheme, the classification loss adopts multi-class entropy loss; ; For 25% of the foreground suggestions, use SmoothL 1 loss to calculate the match with the groundtruth box.

Since the R-CNN fully connected network only outputs one predicted value for each category, it is necessary to be cautious when calculating the boundary regression loss and only consider the correct category.

Similar to RPN, R-CNN finally outputs a bunch of classified objects, and then further processes them and returns the results.

In order to adjust the bounding box, we need to consider the proposal of the category with the maximum probability and ignore the proposal of the background category with the maximum probability.

When the final object is obtained, the result of background prediction is ignored, and the class-based NMS is adopted, which mainly groups the objects by class, then sorts them by probability, treats each independent group with NMS, and finally puts them together.

You can still limit the final list of objects by setting a probability threshold.

In this paper, the faster R-CNN adopts a step-by-step method to train the weights of each module separately, and then combine the training. Since then, it has been found that end-to-end joint training has achieved better results.

When the complete model is combined, four different losses are obtained, two for RPN and two for R-CNN. Four different losses are organized in the form of weighted sum. According to needs, we can set weights for classification loss and regression loss, or set different weights for R-CNN and RPNs.

Training with SGD, momentum =0.9. The initial learning rate is 0.00 1, and the attenuation is 0.000 1 after 50K iterations. This is a set of commonly used parameter settings.