However, this kind of * * * enjoyment only stays in the first convolution part, and RoIpooling and the following parts are not fully enjoyed, which can be regarded as a kind of * * * enjoyment, resulting in two losses: 1. Information loss and accuracy reduction. 2. Because the subsequent network parts are not shared, it is too expensive to repeatedly calculate the parameters such as the full connection layer. (In addition, it needs to be added that the calculation amount of the fully connected layer is greater than that of the fully rolled layer. )
Therefore, RFCN (region-based full convection network) tries to improve the faster RCNN and FCN.
2. 1 question
The first question is how to improve the problem of incomplete enjoyment
FCN(full convolution network) improves the problem of * * * incomplete enjoyment, that is, it replaces the full connection layer used for classification in general backbone networks with a full volume layer, so that the whole network structure is composed of convolution layers, so it is called full volume network.
The second problem is the requirement of target detection.
Obviously, the problem of target detection includes two sub-problems: the first is to determine the object type, and the second is to determine the object position. When determining the object type, we want to keep position insensitive (that is, we can classify the object correctly no matter where it appears) and position sensitive (of course, we want to determine the position of the object no matter how the position changes).
These two requirements seem to be contradictory, and RFCN has made a compromise, but in fact it is not a compromise. It is such a problem: we know that the full convolution network is very strong in extracting features, so it is very good to classify objects, but the ordinary convolution network only pays attention to features and does not pay attention to location information, so it cannot be directly used for detection. Therefore, RFCN introduces the concept of "position sensitive score graph" in FCN network to ensure the sensitivity of full convolution network to the position of objects.
Let's talk about the problem of structure first, and then continue to explain this position sensitivity in the structure.
2.2 structure and process
The following figure describes the structure of RFCN, and the target detection process is as follows:
The original image is convolved to get the feature map 1, and a subnet is like FastRCNN: use RPN to slide on the featuremap 1 to generate a region for standby; The other subnet continues to convolution, and the featuremap2 with depth k 2 (k = 3) is obtained. According to the RoI(region proposal) generated by RPN, these featuremap2 2 are collected, scored and classified, and the final detection results are obtained.
Figure 3 below describes a successful location-sensitive identification. The nine feature maps in the middle of Figure 3 are actually the nine feature maps on the left side of the position-sensitive structure map, and each layer corresponds to an interesting part of the object. For example, the position in this map represents the head of the human body. Therefore, the responses of all positions are stored in the corresponding position on the right side of Figure 3 (C+ 1) once (whether it is now upper middle or upper middle, or lower left or lower left), so that the position sensitivity is preserved.
When the scores of nine boxes of poolingmap all exceed a certain threshold, we can believe that there are objects in this regional proposal.
Figure 4 below shows a failed detection: because the poolingmap score in the red box is too low.
Third, summary.
The above are the notes for reading RFCN. It can be seen that the contribution of RFCN is: 1. Introduce FCN to realize more network parameters and functions (compared with faster RCNN) 2. It solves the problem of insufficient position sensitivity of full convolution network (using position sensitivity score graph).
Compared with the faster RCNN, other structures have little difference (RPN is reserved, * * * enjoys the first layer of con _ Subnetwork feature extraction).
This paper was read without a deep understanding of FCN, so FCN and MaskRCNN are read next, so the two-stage detection method can come to an end first.