Current location - Education and Training Encyclopedia - Graduation thesis - China cvpr paper
China cvpr paper
Share an article from CVPR 202 1, Action-Net: Multipath Elimination of Action Recognition. Authors: Artificial Intelligence Laboratory, Trinity University, Dublin, ByteDance.

In this paper, a plug-and-play action module with mixed attention mechanism is proposed for sequential action recognition (such as gestures). This module includes space-time attention, channel attention and motion attention.

The details are as follows:

0 1 ?

Spatio-temporal attention (STE): By globally averaging all channels, the spatiotemporal attention map of a single channel can be obtained by 3D convolution of 3x3, which makes it possible to obtain the spatiotemporal attention map with a very small amount of calculation. Multiply this attention map with the input features to get the corresponding features excited by spatio-temporal information.

Channel Concern (CE): This block is an Arthur block based on SE-Net. However, because video action contains timing information, 1D convolution is inserted between compressed and uncompressed channels in time domain to enhance the interdependence of channels in time domain. Like SE, we can get a channel-based attention map. Like STE, the channel excitation characteristics of input features are multiplied by the obtained attention points.

Attention to Action (ME): ME has been used in previous works such as STM and TEA. It mainly describes the movement of actions between every two adjacent frames, which is very similar to optical flow. Take the ME structure in the previous work as a branch and connect it with the two blocks mentioned above to get the action module.

02 ?

The action module consists of the above three attention modules in parallel. This module is the same as TSM before, plug and play. In comparison with the most advanced methods, backbone uses the same ResNet-50 as before. At the same time, taking TSN and TSM as the baseline, the performance of ACTION on different backbone networks (RESNET-50, mobilenetv2, BN-Inception) is tested.

03 ?

In the experiment, three video data sets, V2, Clown and Self-Pose, were used to test the proposed action module.

3. 1 Comparison with the most advanced technology

As can be seen from the following table, the performance of the action on Jester and EgoGesture is still very superior, and both of them have achieved the most advanced effects ... Compared with STM and TEA on V2 data set, the results are very similar.

However, it is worth noting that STM and TEA are designed for ResNet and Res2Net, respectively. ACTION is a plug-and-play module and will not be limited by the backbone type. It is reported that the author will show the effect on MobileNet V2 and BN-Inception later.

3.2 Removal study

An efficiency coefficient is defined to quantify the extra calculation required for each increase of the highest 1 accuracy of the action module relative to TSM. The lower represents the higher efficiency. The following figure shows the operation efficiency of three different data sets on three backbones. It can be seen that the efficiency of action is most obvious on MobileNet V2. Similarly, V2 is more efficient than the other two data sets.