Current location - Education and Training Encyclopedia - Graduation thesis - Dense Prediction with Focus Feature Aggregation in Image Segmentation
Dense Prediction with Focus Feature Aggregation in Image Segmentation
Original document: /lart/papers/xnqoi0

The paper accidentally turned out from arxiv can be regarded as an extension of the previous work, that is, hierarchical multi-scale attention oriented to semantic segmentation.

Aggregating information from features of different layers is the basic operation of dense prediction model.

Although its expressive ability is limited, the series connection of characteristics dominates the choice of polymerization operation.

In this paper, we introduce Attention Feature Aggregation (AFA) to integrate different network layers with more expressive nonlinear operations. AFA uses spatial and channel attention to calculate the weighted average of layer activation.

Inspired by neural volume rendering, we extend AFA with Scale Space Rendering (SSR) to perform the late fusion of multi-scale prediction.

AFA is suitable for a wide range of existing network designs.

Our experiments show that the challenging semantic segmentation benchmarks, including Cityscapes, BDD 100K and Mapillary Vistas, have a consistent and significant improvement with negligible computation and parameter overhead. Especially, in urban landscape, AFA improves the performance of Deep Aggregation (DLA) model by nearly 6%. Our experimental analysis shows that AFA has learned to gradually refine the segmentation map and improve the boundary details, thus producing the latest results on the boundary detection benchmark on BSDS500 and NYUDv2.

Two integral forms are designed here, one is suitable for double input and the other is suitable for multi-input progressive integration. The core is based on spatial attention and channel attention. Note that the calculation here is in the form of paired integrals, so after calculating a degree of attention, sigmoid is used to construct the relative weight.

For the dual-input form, spatial attention is calculated by shallow features, because it contains abundant spatial information, while channel attention is calculated by deep features, because it contains more complex channel features. For the multi-input form (only three layers are shown in the figure, but more layers of inputs can actually be introduced), the channel and spatial attention are calculated entirely by the current layer input, and if there is a first layer calculation, the attention will be used to weight the current and previous outputs. In addition, the order of integration is described in the original text as "features with higher priority will undergo a higher number of aggregations". My understanding is that it should be a process from deep to shallow.

The proposed integration module can be used in many structures, such as DLA, UN E-commerce Network, Human Resources Network and FCN.

The SSR proposed here is a strategy more similar to model integration.

It integrates multi-scale reasoning by calculating the relative weights of prediction outputs at different scales. Therefore, there are two issues involved here:

In order to express the integration of multi-scale prediction, the author first pays attention to a single pixel and assumes that the model provides prediction for the target pixel at different scales.

The prediction of th scale can be expressed as. Therefore, the feature representation of the target pixel in the scale space can be defined as. In addition, it is assumed that the representation scale is coarser than the scale.

Then the target pixel can be imagined as light moving in scale space, from scale to scale.

Based on this idea, the original hierarchical attention of the proposed multi-feature fusion mechanism is redesigned, and the volume rendering equation is simulated, in which the volume is implicitly given by the scale space.

Therefore, in addition to the feature representation at this scale, it is assumed that the model will also predict the scalar of the target pixel. In the case of volume rendering, the probability of particles crossing the scale can be expressed as.

Therefore, scale attention can be expressed as the probability that particles reach the scale and stay here (every time they meet the Bernoulli distribution, they must stay or leave, and they are ahead, so they stay at the current time):

A scalar parameter representing the target pixel prediction for each scale.

Finally, according to the volume rendering equation, the final prediction of the target pixel by multi-scale prediction fusion is obtained by weighted summation of the attention parameters of different scales, which also reflects that the final feature of the target pixel is obtained by driving the fusion of feature expressions of all scales.

Based on a comprehensive context analysis, the design here should eventually integrate all scales into 1.

The proposed SSR can be regarded as a generalized form of hierarchical multiscale attention (HMA).

The latter form can be obtained by setting and fixing. At this time, there are:

Judging from the form here, there are two puzzling places:

The input will be scaled again before being sent to the model. The final output size here is equivalent to 1.0 times the original input size. Therefore, it is assumed that the feature is integrated according to the scale number from k to 1, and the result is output at 1 layer.

Because the attention constructed in this paper is based on the probability of not selecting the current layer (through the current layer), the general form is as follows:

It can be seen that the attention weight of the first layer is the output result of the direct sigmoid, while for the output of the k layer, it is obtained by taking the complementary and similar products of the sigmoid outputs of each layer.

The absolute value function: is used in the experiment. This is inspired by the analysis of better retaining gradient flow through attention mechanism, because the author found that the existing attention mechanism may suffer from the problem of gradient disappearance.

Pay attention to the form in which the coefficients are arranged in front:

Consider the derivative of the first layer coefficient about the learnable parameter;

When two scales are considered, namely:

The upper left corner calculates the derivative of the attention coefficient of 1 layer relative to the parameters of 1 layer, and the upper right corner calculates the derivative of 1 layer relative to the second layer. It can be seen that no matter how much, the gradient will disappear.

Therefore, in order to avoid the problem of gradual disappearance, it is still necessary to set it carefully. When the absolute value function is selected, the Jacobian matrix here will not disappear in the case of sum.

Considering the situation of HMA, according to the form provided by the author, there are:

Branch 2 does not participate in attention calculation. When the gradient disappears.

According to my previous form, there are:

There will be the problem of missing.