? At present, Neural Network Structure Search (NAS) has made great achievements in designing the model structure of image classification, but it is very time-consuming, and it is mainly spent on training the searched sub-models. The main work of this paper is to propose efficient neural architecture search (ENAS), which forces all sub-models to enjoy the weight * * * to avoid training from scratch, thus achieving the purpose of improving efficiency. Although different models use different weights, from the research results of transfer learning and multi-task learning, it is feasible to apply the parameters learned from model A of the current task to model B of other tasks. From the experimental point of view, enjoying the parameters is not only feasible, but also brings strong performance. The experiment only uses a single page of 1080Ti, which is 1000 times faster than NAS.
? The search result of NAS can be regarded as a subgraph in the big picture, and the search space can be represented by a one-way acyclic graph (DAG). The structure of each search can be regarded as a subnet of the DAG in Figure 2. The DAG defined by ENAS is the superposition of all subnets, in which each node has its own parameters for each calculation type, and these parameters are only used when a specific calculation method is activated. So the design of ENAS is to let the subnet enjoy the parameter * * *, which will be introduced in detail below.
? In order to design recursive units, DAG of nodes is adopted, in which nodes represent calculation types and edges represent information flow directions, and the controller of ENAS is also RNN, which is mainly defined as: 1) activated edges 2) calculation types of each node. In NAS(Zoph 20 17), the search space of the cycle unit is on a topological structure (binary tree) with a predefined structure, and only the calculation type of each node is learned, while NAS learns both the topological structure and the calculation type, which is more flexible.
? In order to create a loop unit, the controller RNN first samples the result of a block, obtains the information of the current unit (such as word embedding), and outputs it for the hidden layer of the previous time step. The specific steps are as follows:
? Note that each pair of nodes () has independent parameters, and which parameter to use depends on the selected index. Therefore, all loop units of ENAS can share the same parameter set. The search space in this paper contains the configuration of exponential quantity. Assuming that there are n nodes and four activation functions, * * has three configurations.
? The controller of ENAS is an LSTM with 100 hidden cells, and the selection decision is made in an autoregressive way through a softmax classifier. The output of the previous step is embedded as the input of the next step, and the first step of the controller accepts the empty embedded input. The learning parameters mainly include the controller LSTM parameters and the subnet * * * sharing weights. The training of ENAS is divided into two overlapping stages. The first stage is to learn the * * * shared weights on a complete training set, and the second stage is to train the parameters of the controller LSTM.
? The strategy of the controller is fixed, and then random gradient descent is carried out to minimize the expected value of the cross entropy loss function. For the cross entropy loss of small batch model, the model comes from sampling.
? The formula for calculating the gradient is 1, which comes from sampling and updating the gradient of all models. The formula 1 is an unbiased estimation of the gradient, but the variance is very large (like NAS, the performance of the sampling model is different), and the paper finds that the training effect is not bad when used.
? The strategy parameters were revised and updated with the goal of maximizing the expected return. Using Adam optimizer, Williams reinforcement method is used to calculate the gradient, and exponential moving average is added to reduce the variance. The calculation is carried out on an independent verification set, which is basically the same as Zoph's NAS.
? Well-trained ENAS builds a new model. Firstly, several new structures are extracted from the training strategy. For each sampled model, the accuracy of a small batch of verification sets is calculated, and the model with the highest accuracy is retrained from scratch. All sample networks can be trained from scratch, but the method in this paper has similar accuracy and greater economic benefits.
? In order to create a convolutional network, each decision block of the controller makes two decisions, which constitute a layer of the convolutional network:
? Make secondary selection to generate the network of this layer, * * * kinds of networks. In the experiment, l takes 12.
? NASNet proposes to design small modules and then stack them into a complete network, mainly designing convection units and reduction units.
? A convolution unit is generated by ENAS, and a DAG of node B is constructed to represent the calculation in the unit, where node 1 and node 2 represent the input of the unit and are the outputs of the first two units in the complete network, and two choices of the remaining nodes are predicted: 1) Choose two previous nodes as the current node input 2) Choose the calculation type for two inputs, and * * five operators: identity kernel. For, the search process is as follows:
? For the reduction unit, the above search space can be generated in the same way: 1) Sample A is calculated as shown in Figure 5; Figure 2) Change all calculated steps to 2. In this way, the reduction unit can reduce the input to 1/2, and the controller * * * predicts the block.
? Finally, the complexity of the search space is calculated. For node I, troller first selects two nodes, and then selects two of the five operators, namely * * * pit cells. Because these two units are independent, the size of the search space ultimately depends on the type of network.
? Some modifications have been made to the calculation of nodes, for example, the highway is added to connect to, in which element multiplication is added. The search result is shown in Figure 6, which is interesting: 1) All activation modes are tanh or Relu2) The structure may be locally optimal, and the random replacement of node activation functions will cause significant performance degradation; 3) The search output is the average of six nodes, similar to mixed context (MoC).
? Single 1080Ti has been trained 10 hour. The results on Penn Treebank are shown in table 1. The lower the PPL, the better the performance. It can be seen that ENAS has low complexity and few parameters.
? The first block in Table 2 is the structure of DenseNet, the second block is the result of ENAS designing the whole convolution network (I think there should be no micro search space here), and the third block is the result of designing the unit.
? The optimal structure of the whole network search is shown in Figure 7, and the error rate is 4.23%, which is better than NAS. Single card search takes about 7 hours, which is 50000 times faster than NAS.
? The structure of unit search is shown in Figure 8. Single card search 1 1.5 hours, with an error rate of 3.54%. Through shear enhancement, it is better than NASNet. It is found that the structure of ENAS search is locally optimal, and modification will reduce the performance, while ENAS does not sample multiple networks for training, which will greatly improve the performance of NAS.
?
? NAS is an important method to automatically design network structure, but it needs huge resources, which makes it unable to be widely used. However, the efficient neural architecture search (ENAS) proposed in this paper enjoys subnet parameters, which is more than 1000 times faster than NAS, and the single card search takes less than half a day, and its performance does not decrease, so it is very worthy of reference.
?
?
?