BlendMask is a one-stage dense case segmentation method, which combines the idea of top-down and bottom-up methods. Based on the anchor-free detection model FCOS, it extracts low-level detail features by adding bottom modules and predicts attention. At the instance level, the author refers to the fusion method of FCIS and Yorak, and proposes a Blender module to better integrate these two features. Finally, the accuracy (4 1.3AP) and speed (25 fps on 1080 Ti) of BlendMask on COCO are better than those of Mask R-CNN.
Although this article has high accuracy and speed, its innovation is not outstanding. Fortunately, the experiment is sufficient, and the idea of optimizing the model is also worth learning. Finally, it was compared with Mask R-CNN, and it was well received ~
background introduction
This paper mainly discusses dense case segmentation, which also has top-down and bottom-up methods.
Top-down method
DeepMask is the originator of top-down dense instance segmentation, which predicts a mask proposal in each spatial region by sliding the window. This method has the following three disadvantages:
The connection (local consistency) between mask and features is lost, such as using fully connected network to extract mask in DeepMask.
Feature extraction representation is redundant. For example, Deepmask will extract a mask once for each foreground feature.
Loss of position information caused by downsampling (convolution with step size greater than 1)
Bottom-up approach
The general routine of bottom-up dense case segmentation method is to generate embedded features of each pixel, and then use post-processing methods such as clustering and graph theory to classify them. Although this method retains good low-level features (detail information and location information), it also has the following disadvantages:
The quality of dense segmentation is very high, which will lead to non-optimal segmentation
Poor generalization ability, unable to cope with complex scenes with many categories.
The post-processing method is cumbersome.
Mixed method
In this paper, we want to combine these two ideas, and use the example-level high-dimensional information generated by top-down methods (such as bbox) to fuse the pixel-by-pixel prediction generated by bottom-up methods. Therefore, a concise algorithm network BlendMask based on FCOS is proposed. Based on the ideas of FCIS (clipping) and YOLACT, a Blender module is proposed, which can better integrate global information including instance level and provide low-level features of details and location information.
general idea
The overall architecture of BlendMask is shown in the following figure, including detector module and BlendMask module. The FCOS and BlendMask modules directly used by the detector module in this paper are composed of three parts: the bottom module is used to process the bottom features, and the generated score map is called Base;; The top layer is connected in series on the box head of the detector, generating top-level attention; Corresponding base; Finally, blender is used to integrate cardinality and attention.
Mix mask the whole frame.
Please click to enter a picture description.
Bottom module
The structure of this part is similar to FCIS and Yorak. What is the input? The low-level features of are output by backbone network or FPN. Through a series of decoding (upsampling and convolution), a score graph is generated, which is called basic (b). This paper uses the DELABV 3+decoder, and other sub-network decoders are also applicable.
In the above formula, n is the batch size, k is the number of cardinality, h and w are the sizes of input images, and s is the output step size of cardinality.
Deeplab V3+ network framework
Please click to enter a picture description.
top floor
After detecting each layer of the pyramid, a layer of convolution is added to predict the top-level attention (a). Similar to YOLACT, but different:
What is the output in YOLACT? , that is, the weight of the N×K mask, is called the mask coefficient; In the original text;
What is the output dimension in this article? , here? The resolution of attention, that is, the weight of each pixel of the corresponding base, contains finer granularity.
Because top-level attention is three-dimensional, you can learn some instance-level information, such as general shape and posture. The specific realization is that the output channel is? Convolution of.
blender
The mixer module is an innovative part of the article, and the mixing process in this part is explained as follows:
First, define the input of the Blender module, namely:
Bbox proposal (P) generated by detector tower, with the size of (k× h ′× w ′); In addition, GT bbox is directly used as P in training, and the detection result of detector is used in reasoning.
Top-level attention (a) generated by the top level has a dimension of (K×M×M).
The base (b) generated by the bottom module is k mask of the whole graphic size, and the dimension is (K×H×W).
For B, using RoIPooler in mask R-CNN (that is, RoIAlign with sampling ratio of 1 and 4 in mask R-CNN), the mask of the area corresponding to P is cut out on B, and the feature map with fixed size of R×R is adjusted, and the final dimension is (k× r× r);
Please click to enter a picture description.
Please click to enter a picture description.
For A: This step is actually a top-level post-processing operation, so let's talk about it here. According to the post-processing method of FCOS, the author selects the first d detection frames and the corresponding A, and adjusts the dimension of A from (k * m * m, h', w') to (K×M×M) by RoIAlign (sampling rate = 1) and reshape, and records it as a;
For a: since m is generally less than r, make an interpolation and insert a from M×M into R×R size. What did you get? The dimension is (K×R×R)
Please click to enter a picture description.
Then do softmax in k dimension and get a series of scoresmaps. The dimension of is also (K×R×R)
Please click to enter a picture description.
Fusion: At this time? And then what? Their sizes are all (K×R×R), which can be directly made into the product of elements: multiply k bbox-sized masks by the corresponding attention, and then stack them according to the channels to get the final mask.
melting process
Please click to enter a picture description.
experimental result
Parameter setting
The superparameter * * * of BlendMask has the following contents:
In this paper, the resolution of R, the bottom RoI is set to 56.
M, the resolution of top-level prediction (A) is generally much smaller than that of R, and the setting in this paper is 7.
In this paper, the number of k and cardinality is set to 4.
The input features of the bottom module come from backbone network (C3, C5) or FPN(P3, P5), and P3 and P5 are adopted in this paper.
The sampling method of subbase is nearest neighbor mixing method or bilinear mixing method, and bilinear mixing method is adopted in this paper.
Top-level attention interpolation method, nearest neighbor interpolation method or bilinear interpolation method are adopted in this paper.
These superparameters will be ablated later. In order to make a reasonable comparison with other models, the settings of BlendMask used in ablation experiments are as follows: R_K_M is 28, 4 and 4 respectively; The input features of the bottom module come from backbone networks C3 and C5; The top layer pays attention to the nearest neighbor interpolation method, which is consistent with FCIS; Bilinear pool is used at the bottom, which is consistent with RoIAlign.
experimental result
Look at the overall experimental results first. On the COCO data set, the accuracy and speed of BlendMask surpass other single-level instance models, and basically surpass Mask R-CNN (except R-50 and NOAUG).
Please click to enter a picture description.
Mixed mask -RT
At the same time, the author also built a fast version of BlendMask-RT for speed comparison. The changes in Express Edition are as follows:
The convolution number of the prediction head is reduced to 3.
Proto-FPN in YOLACT is the bottom module, and the box tower and the classification tower are combined into one.