Current location - Education and Training Encyclopedia - Graduation thesis - Paper Reading -D2 Network: A Trainable CNN for Joint Description and Detection of Local Features.
Paper Reading -D2 Network: A Trainable CNN for Joint Description and Detection of Local Features.
As another representative of the joint learning (also known as one stage) of detectors and descriptors in recent two years, D2 net is a quite special structure. Its characteristic is "one map can be used for two purposes", that is, the dense tensor of network prediction is both a detection score map and a description map feature map, which represents both the feature detection result and the feature description result (note that the predicted feature map is not the resolution of the original map). In other words, the feature detection module and description module of D2 net are highly coupled.

This paper mainly aims at the task of image matching in scenes with large appearance changes (including day and night changes, large viewing angle changes, etc.). ). The author compares two local feature learning methods: sparse method and dense method. Sparse method is efficient, but it can't extract repeatable key points in scenes with great appearance changes, because the feature extractor only uses shallow image information and doesn't use semantic information; Dense method directly uses depth features to extract dense feature descriptions, which is more robust at the cost of higher matching time and memory overhead.

Therefore, the purpose of the author is to propose a sparse local feature that is robust enough to make the extracted features (points of interest) more repeatable, so as to realize the high efficiency of sparse method and the robustness of dense method at the same time. Its core idea is to postpone the feature extraction stage, so that local features can also use high-level semantic information instead of only considering low-level information.

Question: About the sparse and dense methods here.

Keywords: a single CNN plays a dual role; Joint optimization; Different training/testing model structures

Different from SuperPoint or SEKD, although this paper is also a dense prediction structure, it does not predict kpt and description at the same time, but only predicts a feature map with the shape of HxWxd(d is the length of feature description), and then makes a description result and a detection result at the same time ... From the spatial dimension, each pixel position of the feature map is a descriptor; From the channel dimension, each channel represents the detection result of a feature detector, and a total of D 2D response graphs are obtained, which can be compared with the Gaussian differential pyramid response in SIFT.

Subsequent extraction of interest points requires further post-processing of the feature map of this D channel:

According to the definition of D2 feature map mentioned above, if the (i, j) position is the point of interest, the final detection result of the pixel position must take the channel corresponding value with the largest detector response value from the channel dimension, so as to select the channel; In terms of spatial dimension, the 2D map of this position in the channel must be a local maximum. This is the "hard feature detection" in this paper:

Firstly, an image pyramid is constructed for the input image, and then the D2 feature map is obtained by moving forward on each scale. Then the multiscale feature map is upsampled and fused with the same resolution (see the following formula), and the fused feature map is obtained. In the prediction stage, feature points can be extracted through the above post-processing according to the fused feature map.

Due to the above characteristics, the network structure itself is very simple. The pre-training weights on ImageNet are directly restored with the parts before VGG 16 conv4_3, and then all but the last conv4_3 layer are frozen, and only this layer is fine-tuned. But about the model, there are two noteworthy places:

1. The result of using VGG 16 is much better than ReseNet.

2. The model structure during training and testing is different.

Specifically, in the testing stage, in order to improve the resolution of features, pool3 is changed to avg pool with a span of 1, and the expansion ratio of the following three layers of conv is adjusted to 2 to maintain the same receptive field. The author's explanation is that in order to reduce the resolution of small features used in memory during training and improve the feature positioning ability during testing, the resolution is increased to 1/4 of the original image, and a local feature extraction similar to that used in SIFT is added, and then the feature interpolation is up-sampled to the original resolution.

However, the above-mentioned hard feature detection cannot be used in the training process because it is non-differentiable. Therefore, the author puts forward a version of soft, whose design idea is to imitate the channel selection and spatial location selection (that is, the local maximum in the channel) in the hard method:

For spatial location selection, the author will find an α(i, j) for each pixel of the feature map, and get an α map (with the shape of [h, w, d]):

Where N(i, j) represents a 9- neighborhood centered on (i, j). Therefore, it can be seen that the local maximum here is actually the maximum in 3x3 region, instead of only outputting one maximum in the whole channel as written in Formula (3).

For channel selection, directly calculate a ratio-to-max to get a β graph (shape [h, w, d]):

According to the definition of kpt, the score graph S should be the result of maximizing the product graph of α graph and β graph in the channel dimension. Finally, make a normalization: (Question: What does this normalization mean, and the sum of pixel values of the score map is 1? Don't use sigmoid in the scoring chart, etc. Is the distribution between 0- 1 reasonable? )

There is another question to consider in this part. Why does D2 network need to extract points of interest in training? (For example, R2D2 and other structures directly optimize the kpt score chart, and the actual prediction only needs the step of extracting feature points according to the score chart. )

A: The understanding of this question is incorrect. In training, we don't extract points of interest, but get a "single-game score map". The above-mentioned hard feature detection is equivalent to the NMS process, and the output is the sparse position coordinates of the points of interest; However, the training and detection module needs the score map of hxw, so the feature map of hxwxd needs a differentiable step to get the score map.

(1) triple marginal ranking loss (only consider descriptors)

In fact, there is not much difference in training descriptors, that is, according to the corresponding relationship of input pairs, each matching pair C is regarded as a positive pair, and the wrong pair is regarded as a negative pair, and the triple is trained. The main problem is how to construct the most meaningful negative pair of C according to the current matching pair. Here, the author uses the difficult sample mining strategy based on neighborhood. If the current matching is point A and point B in the figure below, then find negative pairs in the neighborhood of I 1 and I2 minus A\B, and compare them with the descriptor dB of B and dA of A respectively, and find the smallest similarity among all these negative pairs to form a triple with C.

The following p(c) and n(c) represent positive distance and negative distance respectively. M(c) represents the triplet loss currently matched with c.

(2) Add triple marginal ranking loss that describes suboptimal optimization.

Because D2 features represent interesting fractional graphs and descriptors, the optimization in this paper needs the joint optimization of detection and description. On the basis of triple differential sequence loss, the optimization goal of improving the repeatability of detection results is added. The specific implementation method is as follows: using the detection scores of all the corresponding relationships in the two input images, the triple loss calculated by the current matching is weighted average. If the triple loss of the current matching is very low (that is, the matching distance of the pair is much smaller than its most difficult negative pair), in order to minimize the loss, the corresponding relationship with small triple loss (that is, high discrimination) will naturally be given greater weight; Other triples with large losses give smaller weights.

I feel that the symbol of the formula (13) in this paper is a bit confusing. M(p(c), n(c)) is written directly, and m(c) may be more concise.