Current location - Education and Training Encyclopedia - Graduation thesis - Do you have a paper on target detection?
Do you have a paper on target detection?
Paper: Efficient detection: scalable and efficient object detection.

? At present, in the field of target detection, high-precision models usually require a lot of parameters and calculations, while lightweight networks generally sacrifice accuracy. Therefore, this paper hopes to build a high-precision, high-performance and extensible detection framework. Based on the one-stage detection network paradigm, this paper attempts to build a variety of backbone networks for feature fusion and class/box prediction, which mainly faces two challenges:

? FPN is the most widely used multi-scale fusion method at present, and recently there are cross-scale feature fusion methods such as PANet and NAS-FPN. In order to fuse different features, the initial method is simple direct addition. However, due to the different resolution of different features, the enjoyment of fused output features should be unequal. In order to solve this problem, a simple and efficient weighted bi-directional feature pyramid network (BIFPN) is proposed. This network uses learnable weights to learn the importance of different features, and at the same time, multi-scale fusion from top to bottom and bottom to top is repeated.

? This paper holds that besides scaling the resolution of backbone network and input image, scaling of feature network and box/class prediction network is also very important for accuracy and performance. Based on EfficientNet, the author puts forward a composite scaling method for detecting networks, and scales the resolution/depth/width of backbone networks, feature networks and box/class prediction networks.

? Finally, this paper takes EfficientNet as the backbone, combines BiFPN and mixed calibration, and puts forward a new detection series EfficientDet, which has high precision and light weight. The results of COCO are shown in figure 1, and the contributions of this paper are as follows:

? Multi-scale features are defined, and the goal of this paper is to find a change function to effectively fuse different features and output new features. Specifically, fig. 2a shows a top-down FPN network structure. In general, FPN has only one layer, and the repeated forms should be written here for comparison. FPN gets the input of layer 3-7, which represents the feature layer with resolution of.

? Top-down FPN operation, as shown in the above figure, is to align the resolution by up-sampling or down-sampling, which is usually a convolution operation of feature processing.

? Top-down FPN is limited by one-way information flow. In order to solve this problem, PANet (Figure 2b) adds an additional fusion network with bottom-up path, and NAS_FPN (Figure 2c) uses neural architecture search to obtain a better topology structure of cross-scale feature network, but it needs a lot of resources to search. PANet is the one with the highest accuracy, but it needs too many parameters and calculations. In order to improve the performance, this paper makes several improvements to the cross-size connection:

? Most feature fusion methods treat input features equally, but it is observed in this paper that inputs with different resolutions should make different contributions to the fused output features. In order to solve this problem, this paper proposes to add extra weight prediction to input features in the process of fusion, mainly through the following methods:

? Is a learnable weight, which can be per feature, per channel or multi-dimensional tensor. It is found that the scalar form is enough to improve the accuracy without increasing the amount of calculation. However, because the scalar form is infinite, it is easy to cause instability in training, so it should be standardized and limited.

? Softmax is used to normalize all weights, but the operation of Softmax will lead to the decline of GPU performance, which will be explained in detail later.

? Relu guarantees that the value is stable. In this way, the normalized weight also falls on the table, because there is no softmax operation, and the efficiency is higher, about 30% faster.

? BiFPN combines bidirectional cross-size connection and fast normalized fusion. The 6-level fusion operation is the middle feature of the top-down path and the output feature of the bottom-up path. Features of other layers are constructed in a similar way. In order to further improve the efficiency, feature fusion adopts depth-separable convolution, and batch normalization and activation are added after each convolution.

? The structure of EfficientDet is shown in Figure 3. Based on the paradigm of the first-level detector, ImageNet-pre-trained EfficientNet is used as the backbone, and BiFPN takes the 3-7 layer features of the backbone as the input, and then repeats the top-down and bottom-up bidirectional feature fusion. All layers * * * enjoy class and box networks.

? The scaling of previous detection algorithms is aimed at a single dimension. Inspired by EfficientNet, a new hybrid scaling method for detection networks is proposed, which uses mixed factors to simultaneously scale the width, depth and resolution of backbone networks, BiFPN networks and class/box networks. Because there are too many scaling dimensions and the grid search efficiency used by EfficientNet is too slow, this paper uses heuristic scaling method to scale all dimensions of the network at the same time.

? EfficientDet reuses the width and depth factors of EfficientNet and efficient net-B0 for EfficientNet-B6.

? In this paper, the width (number of channels) of BiFPN is scaled exponentially, and the depth (number of layers) increases linearly, because the depth needs to be limited to a small value.

? The width of Box/class prediction network is the same as that of BiFPN, and the depth (# layer) increases linearly according to Equation 2.

? Because BiFPN uses the characteristics of 3-7 layers, the resolution of the input picture must be separable, so Equation 3 is used to linearly improve the resolution.

? Combining with the difference of formula 1-3, the efficiency detection -D0 to efficiency detection -D6 is proposed. Specific parameters such as table 1 and EfficientDet-D7 are not used, but the input resolution is improved on the basis of D6.

? SGD optimizer with momentum =0.9 and weight attenuation =4e-5 is used for model training. In the initial 5% preheating stage, the learning rate increases linearly from 0 to 0.008, then decreases according to the cosine attenuation law, and the batch normalization is increased after each convolution. Batch norm attenuation =0.997, ε= 1e-4, exponential moving average of gradient, attenuation =0.9998, focus loss of sum, aspect ratio of bbox, 32 GPUs, batch size = 128, D0-D4 and D5-D7.

? Table 2 shows the comparison results between EfficientDet and other algorithms, and EfficientDet has higher accuracy and better performance. In the low precision region, Efficiency-D0 has the same precision as YOLOv3, but only uses 1/28 for calculation. Compared with RetianaNet and Mask-RCNN, the same accuracy only takes 1/8 parameters and 1/25 calculations. In the high-precision region, EfficientDet-D7 reaches 5 1.0mAP, which is 4 times less parameters and 9.3 times less computation than NAS-FPN, while anchor only uses 3×3 instead of 9×9.

? In this paper, the reasoning speed of the model is compared on the actual machine. As shown in Figure 4, the acceleration of EfficientDet on GPU and CPU is 3.2 times and 8. 1x respectively.

? This paper compares the specific contributions of backbone network and BiFPN through experiments, and the results show that both backbone network and BiFPN are very important. It should be noted here that the first model should be RetinaNet-R50(640), and the second and third models should have 896 inputs, so the improvement of accuracy is partly due to this reason. In addition, after using BiFPN, the model is simplified a lot, which is mainly due to the reduction of channels. The channels of FPN are all 256 and 5 12, while BiFPN only uses 160 dimension, so it should not be repeated here.

? Table 4 shows the accuracy and complexity of the same network with different cross-size connections in Figure 2, and BiFPN is quite good in accuracy and complexity.

? Table 5 shows the comparison of the two weighting methods under different model sizes. The fast normalized fusion method proposed in this paper improves the speed by 26%-3 1% with little precision loss.

? Fig. 5 shows the weight change process of the two methods in the training process, and the change process of fast normalization fusion is very similar to that of softmax method. In addition, we can see that the weights change very quickly, which proves that different features do have different contributions.

? This paper compares the mixed scaling method with other methods. Although it was similar at first, with the increase of the model, the effect of mixing accuracy became more and more obvious.

? In this paper, a lightweight cross-scale FPN and detection version of the customized hybrid scaling method BiFPN is proposed. On the basis of these optimizations, a series of efficient Det algorithms are introduced, which not only maintain high accuracy, but also maintain high performance, and the efficiency reaches SOTA. Generally speaking, the idea of the paper is based on the previous EfficientNet, and the innovation may not be as amazing as before, but from the experimental point of view, the new detection framework introduced in the paper is very practical, and the author is expected to open source.

?

?

?