? This paper holds that most excellent fine-grained image recognition methods help to identify by exploring the local features of the target, but do not label local information, but use weak supervision or unsupervised methods to locate the local feature positions. Moreover, most methods use pre-trained detectors, which can't capture the relationship between targets and local features well. In order to describe the picture content better, it is necessary to consider the information from pixel to target and then to the scene in more detail, not only to locate the local features/targets, but also to describe their rich and complementary features from multiple dimensions, so as to get the complete picture/target content.
? Considering how to describe the target from the perspective of convolutional networks, a context-aware attention pool (CAP) module is proposed, which can effectively encode the location information and appearance information of local features. The module takes the output features of convolutional network as input, and learns the importance of different regions in adjusting features, so as to obtain rich appearance features and spatial features of local regions, and then make accurate classification.
? The main contributions of this paper are as follows:
? The overall flow of this algorithm is shown in the above figure. It inputs pictures and outputs specific subordinate categories, including three components (three sets of parameters):
[Image upload failed ... (Picture -bc43b- 1644805770766)]
? The module whose output features are defined as CAP comprehensively considers the contextual information of pixel-level features, small-area features, large-area features and picture-level features for classification.
[Image upload failed ... (Picture-818dc8-1644805770766)]
? The context information of pixel-level features mainly learns the degree of correlation between pixels. When calculating the output of the position, all other pixel features are synthesized according to the degree of correlation, which is directly realized by self-attention, and the feature transformation uses convolution. This step directly operates the characteristics of backbone network output, but it is not reflected in the overall flow chart.
? In order to learn context information more effectively, basic regions with different granularity levels are defined on the feature map, and the granularity level is determined by the size of the region. Taking the smallest area as an example, a series of areas can be derived by enlarging the width and height. Similar region sets are generated in different locations, and the final region set is obtained. Covering all areas with different aspect ratios can provide comprehensive contextual information and help to provide subtle features at different levels of the picture.
? According to the previous step, we get 20 regions from the smallest to the largest on the feature map. The goal of this paper is to represent regions of different sizes as features of fixed size, mainly using bilinear interpolation. Defined as coordinate transformation function, as regional coordinates, and the corresponding eigenvalue is, then the value on the coordinates of the converted picture is:
? For sampling function and kernel function, the most primitive method is adopted here, which maps the target coordinates back to the original image, takes the nearest four points, outputs them according to the distance, and finally obtains fixed features after merging.
? Here, the paper uses a brand-new attention mechanism to obtain context information and output it according to the similarity with other features, so that the model can selectively focus on more relevant areas, thus generating more comprehensive context information. Outputting a context vector with query terms and a group of keyword terms;
? The parameter matrix and key terms used to transform input features into query terms are nonlinear combinations, and the sum is offset terms. The total learnable parameter is 0, and the note indicates the similarity between the two features. In this way, the context vector can represent the context information contained in this area, which is obtained according to its correlation with other areas, and the overall calculation idea is basically similar to self-concern.
? Context vector describes the key degree and characteristics of a region. In order to further increase the structural information related to spatial arrangement, this paper transforms the context vector of a region into a region sequence (from top to bottom, from left to right) and inputs it into the recurrent neural network, and uses the hidden units of the recurrent neural network to represent the structural features.
? The middle feature of this region can be expressed as LSTM, which contains the relevant parameters of LSTM. In order to increase the generalization ability and reduce the calculation amount, the context features are obtained by global average pooling, and finally the hidden state sequence corresponding to the context feature sequence is output for subsequent use by the classification module.
[Image upload failed ... (Picture -AAB 286- 1644805770766)]
? In order to further guide the model to distinguish subtle changes, this paper proposes a learnable pooling operation, which integrates feature information by combining hidden layers with similar responses. Based on the idea of NetVLAD, this paper uses derivative clustering method to transform the response value of hidden layer. Firstly, the response of hidden layer and the correlation between class clusters are calculated, and then weighted into the VLAD coding of class clusters:
[Image upload failed ... (Picture -2d95b2- 1644805770766)]
? Each cluster has its learnable parameters and. The whole idea is based on softmax, and the hidden layer response values are assigned to different clusters according to the weight of softmax. After obtaining the coding vectors of all clusters, we use learnable weights and softmax to normalize them. Therefore, the learnable parameters of the classification module are.
[image upload failed ... (image-d9E014-1644805770766)]
? Different methods are compared on different data sets.
? Accuracy comparison under different backbone networks.
? Visualization of output characteristics of different modules. Figure b shows the output characteristics of the backbone network after adding CAP.
? In this paper, a fine-grained classification solution CAP is proposed, which helps the model discover the subtle feature changes of the target through the context-aware attention mechanism. In addition to pixel-level attention mechanism, there are regional attention mechanism and local feature coding method, which are very different from previous visual schemes and worth seeing.
?
?
?
?
Most people are no strangers to mosquitoes; If you meet them while traveling or sleeping, you just want to cry. According to estimates, mosquitoes kill