Knowledge refining can be regarded as a teacher network transferring knowledge to the student network by providing soft labels, which can be regarded as a more advanced label smoothing method. Compared with hard labels, soft labels have the following advantages:
So what is the role of knowledge sublimation in network structure search? Summarized as follows:
Knowledge refining is used to train skills in many jobs, such as using progressive contraction training strategy in OFA, using the largest network to guide the learning of small networks, and using in-situ refining to refine. In BigNAS, the sandwich rule is used, and the largest network guides the refining of the remaining networks.
Objective: To solve the matching problem between teachers' network and students' network.
In the sublimation of knowledge, in the case of choosing different teachers' networks and different students' networks, the performance of the final students' networks is very different. If the capacity difference between students' network and teachers' network is too big, it will cause students' learning difficulties. This article on cream is to solve the problem of matching the two.
As shown on the left, the commonly used SPOS method is trained by sampling the single-path subnet. On the right, combined with the method of knowledge extraction, Cream proposed two modules:
The central idea of Cream is that sub-networks can learn cooperatively and teach each other in the whole training process, with the aim of improving the convergence of a single model.
The ablation experiment is as follows:
Objective: Teachers guide the learning of each feature layer and judge the performance of each subnet according to the loss.
This is a very deep integration of NAS and KD, which was accepted by CVPR20. I have written an article to explain it before, so here is a brief review.
DNA is a two-stage NAS method, so distillation is also introduced to replace the common acc index. It is suggested that the proximity between the subnet and the teacher network should be used as an index to measure the performance of the subnet.
In the training process, block distillation is carried out, and the input of a certain layer of student network comes from the output of the upper teacher network, and the output of this layer of student network is forced to be consistent with the output of the teacher network (using MSELoss). After the search process, the sub-networks are measured by calculating the proximity between each sub-network and the teacher network.
Objective: To prevent students from overestimating or underestimating teacher network by improving KL divergence.
The above figure is the distillation method commonly used in OFA, BigNAS and other search algorithms, and the subnet is measured by KL divergence. This paper analyzes the limitations of KL divergence: zero avoidance and zero compulsion. As shown in the following formula, P is the teacher's logical output and Q is the student's logical output.
AlphaNet proposes a new divergence loss function to prevent overestimation or underestimation. As shown below, introduce.
Where it is not 0 or 1, as shown in the following figure:
The blue line corresponds to Example 2. When it is negative, if q overestimates the uncertainty in p, the value of will become larger.
The purple line corresponds to the example 1, that is, when it is a positive number, if q underestimates the uncertainty in p, then the value of will become larger.
Consider two cases at the same time, and take its maximum value as divergence:
Objective: To propose a measure index to measure the similarity of internal activation between students' network and teachers' network, and to speed up the search of network structure through representation matching.
This part actually belongs to the relationship-based knowledge in knowledge refining classification, and the constructed knowledge is composed of the interaction between different samples.
As shown in the above figure, the specific index structure is a matrix with the size of bsxbs, which is called the representative dissimilarity matrix in this paper. Its function is to construct the internal representation of the activation layer, and the relationship coefficient of the upper triangular matrix can be calculated by evaluating the similarity of RDM, such as Pearson coefficient.
In fact, this paper also constructs an index P+TG to measure the performance of the subnet and choose the best subnet.
As shown in the above figure, the calculation of RDM is to measure the similarity between the characteristics of teachers' network and students' network, and the RDM with the highest similarity is selected. By constructing a set of indicators, with the progress of the times, the consistency of ranking can be rapidly improved.
Objective: To determine the network of teachers and find the most suitable network of students.
For the same teacher network, the generalization ability of student networks with different structures is different even if they have the same flops or parameters. In this work, a fixed teacher network is selected, and the best student network is found through network search. On the basis of L 1 norm optimization, the student network with the smallest KL divergence difference with the teacher network is selected.
Objective: To find the most suitable student network given the teacher network.
The knowledge in neural network is not only contained in the parameters, but also influenced by the network structure. The general method of KD is to extract teachers' network knowledge into students' network. This paper proposes an architecture-aware knowledge extraction method-Architecture-aware knowledge extraction method (AKD), which can find the most suitable student network and extract it into a specific teacher model.
Motivation: First, a group of experiments were done and it was found that different teachers' networks tend to be different students' networks. Therefore, in NAS, using different teachers' networks will lead to the model choosing different network structures.
AKD's method is to choose reinforcement learning to guide the search process, and use ENAS sampling method through RNN.
Objective: To adjust the capacity of student network model by using NAS. NAS+KD+ integration.
As explained before, this article is a hodgepodge of network structure search, knowledge extraction and model integration.
See:/DD _ PP _ jj/article/details/121268840 for details.
This article is quite interesting. By integrating the subnets obtained in the previous step, you can get the teacher network, and then use the method of knowledge sublimation to guide the learning of the new subnet. Focus on:
Inspired by the Regeneration Network (BAN), AdaNAS proposed the Adaptive Knowledge Dispersion (AKD) method to assist the training of sub-networks.
Integration model selection:
From left to right represents four iterations, and each iteration selects three models from the search space. The model shown in the green wireframe represents the optimal model in each iteration, and AdaNAS selects the optimal subnet in each iteration as the integration object.
Additional weight parameters w 1-w4 were added during final integration:
The final output logic layer is as follows: (This W weight will also be trained. At this time, the weight of each integrated network is fixed, and only w) is optimized.
Knowledge distillation
Objective: To solve the efficiency and effectiveness of knowledge sublimation, and use feature aggregation to guide the learning of teacher network and student network. Network structure search is embodied in the process of feature aggregation, and the scaling coefficient is adjusted adaptively by darts method. ECCV20
This paper summarizes several distillation examples:
The last one is the method proposed in this paper. Common feature extraction is the mutual extraction of the last feature map of each block. This paper holds that the whole block of teacher network can guide student network.
Specifically, how to aggregate all feature maps in the whole teacher network block, this paper uses darts method to dynamically aggregate information. (a) The figure shows the distinguishable search process of group I .. (b) The construction of using CE loss to represent the path loss from teachers to students. (c) L2 loss is used to represent the construction of path loss from students to teachers. The connector is actually the convolution layer of 1x 1.
(ps: connector is reminiscent of VID)