First of all, we should know why the accuracy of the model will be reduced because of the large size difference of the measured objects. The backbone network of each model in the field of object detection is nothing more than extracting the depth information of the image step by step by using multi-layer convolution, generating multi-layer feature maps, and further processing such as positioning and classification based on the depth feature maps.
In this "from shallow to deep" feature extraction process, shallow features have high resolution and can carry rich geometric details, but the receptive field is small and lacks semantic information. On the contrary, the depth feature has a large receptive field and rich semantic information, but the resolution is not high, so it is difficult to carry geometric details. At this time, if we continue to deepen the model, there will be a great receptive field in the ultra-deep features, and the semantic information of the measured object will be diluted and reduced by the surrounding environmental information.
What happens if the training data contains both extremely large and extremely small objects?
As shown in the above figure, assuming that the model1* * has 100 layers, the detailed information of large and small targets will decrease with the deepening of the model layers. For semantic information, due to the small scale of small targets, with the increase of model layers (the increase of down-sampling times), semantic information may be extracted at the 25th layer, and then with the increase of layers, the semantic information of small targets will be rapidly diluted by environmental information. However, large targets are large in scale, and it may take 50 layers to extract enough semantic information, but at this time, the semantic information of small targets has almost been lost.
So should the depth of this network be set at 25, 50 or 37 floors? Setting 25 floors has good detection effect on small targets, but poor detection ability on large targets; Setting 50 floors is the opposite; The detection ability of the two kinds of targets is relatively balanced at the level of 37, but they are not in the best detection state. This is the root of the problem of multi-scale target detection.
The following are several common strategies to reduce the accuracy due to the large scale change range.
The collection of multiple images with different resolutions generated by the same image, from large to small, is the image pyramid. An image pyramid can be obtained by continuously downsampling the image until it reaches a certain termination condition. This process seems simple, but it can effectively explain the image from the perspective of multi-resolution. The bottom of the image pyramid is a high-resolution representation of the image to be processed, while the top is a low-resolution approximation of the image.
The process of using image pyramid to solve the problem of "the scale of the measured object changes greatly, which leads to the decline of model accuracy" can be seen as follows: after a picture is processed into a picture pyramid, with the change of pyramid level, a single measured object will also produce a variety of scales from large to small. After these pictures of different scales are introduced into the model, even if the model is only good at identifying objects in a certain scale range, no matter how big or small the measured objects are, they can always be scaled to the scale range that the model is good at processing at a certain level of the pyramid. Please think about the advantages and disadvantages of the "image pyramid" method.
Although this processing method solves the problem of large-scale change of the measured object through multi-scale feature extraction, it will greatly increase the memory occupation and bring difficulties to the training of complex networks. It will greatly increase the computational complexity of the model and lead to a longer model reasoning time.
The pyramid feature hierarchy can be regarded as a feature-based image pyramid. Generally, the shallow feature map of the model is large, and the convolution of stride=2 or pool will be gradually adopted to reduce the size of the feature map. Both the larger feature map in front and the smaller feature map in the back can be used for detection.
Single detector model (SSD) attempts to use pyramid feature levels. The feature map in the "extra feature layer" of SSD is downsampled many times to form four feature maps with different scales, and then these four feature maps are repeatedly used in the forward propagation process to make predictions respectively, so the "pyramid feature layer" will not increase the calculation amount of the model and can be regarded as zero calculation cost. At the same time, however, SSD did not reuse the feature map in VGG-Base, but added several new layers after the top layer of the network to build a pyramid, so it missed the opportunity to reuse the feature map with higher resolution at the feature layer, but these feature maps with higher resolution are very important for detecting small targets.
The full name of FPN is feature pyramid network, that is, "feature pyramid network", and its overall structure is shown in the above figure. FPN is a kind of network structure, which uses the inherent multi-scale feature mapping of deep convolutional neural network, and constructs feature pyramids with advanced semantic information of different scales with minimal additional computation by increasing horizontal connection and up-sampling.
For the target detection model, the FPN structure is not an independent module in the model, but is integrated into the convolutional neural network as an additional item of the original trunk. FPN structure can be divided into two main lines: bottom-up path and top-down path and horizontal connection. Here, we will use ResNet as the original backbone to explain how the FPN structure works.
The bottom-up path is the bottom-up path, that is, the structure on the left side of the FPN structure diagram, which is equivalent to the standard ResNet backbone network. First, review the ResNet network structure:
The original input size of ResNet 18 network is (224x224), and then it passes through convolution layer or pool layer with stride=2. Scale the feature map to (112x112), (56x56), (28x28), (14x 14) and (7x7) step by step (ignored)
Top-down path and lateral connection can be divided into "top-down path" and "lateral connection", that is, the structure on the right side of FPN structure diagram. The operating rules of this part can be expressed as follows:
Operation rule 1: Take the top output C5(size=7x7) of the left bottom-up path through horizontal connection, and adjust the channel number through convolution 1x 1 (this channel number is adjusted to 256 in this paper, so as to calculate the faster RCNN later). The result is the top-down path and the top level of the horizontal connection structure, which can be marked as M5 (.
Operation rule 2: Take the left output C4(size= 14x 14) through horizontal connection, double the up-sampling M5 (size= 14x 14) through nearest neighbor interpolation, and then add C4 and the up-sampling result, and the result can be marked as M4 (size = 65438).
By analogy, M3(size=28x28) and m2 (size = 56x56) can be calculated later (M 1 exists in theory, but C 1 is only obtained by convolution of the original image once, so there is almost no semantic information, so it is generally not calculated). After the operation is completed, the structure diagram can be expressed as:
The results obtained from the above steps can be labeled as {M2, M3, M4, M5} in turn. However, this is not the final output result, because the aliasing effect generated in the up-sampling process will affect the subsequent prediction, so it is necessary to convolve all the feature maps obtained by up-sampling 3×3 to eliminate the influence of aliasing effect and get the final output result, which can be marked as {P2, P3, P4, P5}. Since M5 is not obtained by up-sampling, only {M2, M3, M4} needs to be convolved.
The above figure shows the whole calculation flow of FPN structure.
Data enhancement is the simplest and most effective method to improve the performance of small target detection. Different data enhancement strategies can enlarge the scale of training data sets and enrich the diversity of data sets, thus enhancing the robustness and generalization ability of detection models. Common data enhancement methods are shown in the following figure:
Although the data enhancement strategy solves the problem that small targets lack information, appearance features and textures to a certain extent, it effectively improves the generalization ability of the network and achieves good results in the final detection performance, but it also brings about an increase in calculation cost. And in practical application, it is often necessary to optimize the target characteristics. Improper data enhancement strategy may introduce new noise and damage the performance of feature extraction, which also brings challenges to the design of the algorithm.
The method of generating countermeasure learning aims to map the features of low-resolution small targets into features equivalent to high-resolution targets, so as to achieve the same detection performance as larger targets. Although the data enhancement, feature fusion and context learning mentioned above can effectively improve the performance of small target detection, the performance gains brought by these methods are often limited by computational overhead. For example, Noh and others put forward a new feature-level super-resolution method, which makes the generated high-resolution target features keep the same receptive field size as the low-resolution features generated by the feature extractor through hole convolution, thus avoiding the problem of wrong super-resolution features caused by mismatched receptive fields.
"FPN is a kind of network structure, which uses the inherent multi-scale feature mapping of deep convolutional neural network, and constructs feature pyramids with different scales of advanced semantic information with minimal additional computation by increasing horizontal connection and up-sampling."
The structure of 1 FPN is compared with the standard convolutional neural network with single feature map output: the convolutional neural network with single feature map output only outputs the last feature map, which is easy to lose the detailed information of small targets.
2. Compared with the image pyramid, the 2.FPN structure uses the inherent multi-scale feature map of the deep convolution neural network, and does not need to scale the original image at multiple levels, thus greatly reducing the calculation amount.
Comparison between 3.3. FPN structure and pyramid feature layer: The pyramid layer in SSD does not use the underlying feature map in the original backbone network, resulting in the loss of some detailed information. The feature map is directly generated by the forward propagation process, so the semantic information of the deep feature map cannot be combined with the shallow structure information. However, FPN not only retains the detailed information of the underlying feature map, but also fuses the deep semantic information with the shallow geometric details through the "top-down" structure.
To sum up, FPN complements the "top-down" data flow direction on the basis of the "bottom-up" data flow direction of standard convolutional neural network through horizontal connection. This structure can effectively enrich the semantic information contained in the bottom feature map. The pyramid of FPN structure is calculated by Conv2, which makes the bottom geometric details information particularly rich, especially the position information of small targets, which is very helpful to improve the recall rate of small targets.
1. Replication enhancement: Kisantal M, Wojna Z, Murawski J, et al. Enhancing small object detection [EB/OL].(20 19? 02? 19)[20 19? 02? 19]. engineering object detection [C]//IEEE conference on computer vision and pattern recognition. New york: IEEE, 2017: 217-2125.
8. image pyramid: Adelson e h, Anderson c h, Bergen j r, et al. pyramid method in image processing [j].RCA engineer,1984,29 (6): 33? 4 1.
9. Zou Zhong, Zhong, Guo Yong, et al. Overview of target detection in the past 20 years [EB/OL].(20 19? 05? 13)[20 19? 05? 16]. https://arxiv.org/ABS/1905.05055. [Baidu Academic]