Current location - Education and Training Encyclopedia - Graduation thesis - On pruning research
On pruning research
This paper mainly talks about the current development of pruning from several papers and blogs, and talks about the experimental effect of actual pruning from some of my own practices.

The theoretical basis of pruning is parameterization. In traditional machine learning, over-parameterization means over-fitting.

However, parameterization is indispensable in deep learning.

(The following content comes from the blog [some intuition about over-parameterization] /P/405 16287))

In deep learning, it is necessary to start training with large and over-parameterized models, because such models have strong representation and optimization capabilities. Once we train to the reasoning stage, we don't need so many parameters. This assumption supports that we can simplify the model before deployment. The pruning and quantization methods in model compression are based on this premise.

The following content is taken from the network pruning of gossip model compression)

The core problem of pruning is how to cut the model effectively and minimize the loss of accuracy.

Actually, this is not a new problem. The pruning of neural networks was studied in the late 1980s and early 1990s. For example, the paper "Comparison of Deviation between Minimum Network Construction and Back Propagation" proposes a pruning method based on size, that is, the number of hidden units is minimized by applying weight attenuation related to its absolute value to each hidden unit in the network. For example, in the classic papers "Optimal Brain Injury" and "The Second Derivative of Network Pruning: Optimal Brain Surgeon" in the early 1990s, OBD and OBS methods were proposed respectively. They measure the importance of weight in the network according to the second derivative of loss function relative to weight (Hessian matrix of weight vector), and then cut it. But in the big environment at that time, neural network (there was no deep neural network at that time, only neural network, or shadow neural network used for differentiation) was not a particularly mainstream branch of machine learning, so there were not a lot of branches and leaves for a long time, but their combing definition and problem-solving ideas had a far-reaching impact on a lot of work after more than 20 years.

By 20 12, we all know that deep learning is famous and shines brilliantly.

After that, the wind of brushing the list rose and intensified, and everyone's attention was focused on improving the accuracy. Therefore, the general trend is to deepen the weighted network to improve the accuracy, and the accuracy of ImageNet reaches a new high every year.

During the period of 20 15- 16, Hang Song and others published a series of works on deep neural network model compression. Such as "Learning the weights and connections of efficient neural networks" and "EIE: Efficient Impact Engine on Compressed Deep Neural Networks".

Among them, "Deep Compression: Using Pruning, Training Quantization and huffman encoding to Compress Deep Neural Networks" won the best paper ICLR 20 16. Among them, the classic networks AlexNet and VGG are compressed. Combining pruning, quantization, huffman encoding and other methods, the network scale is compressed several times and the performance is improved several times. Among them, for the precision loss caused by pruning, the iterative pruning method is used to compensate, which can make the precision almost no loss. This makes everyone realize that the redundancy of DNN parameters is so great that so much oil can be produced. In the following years, the field of model compression became more abundant, and more and more related work gave birth to various gameplay.

According to whether the structure after pruning is still symmetrical, pruning can be divided into structured pruning and unstructured pruning.

Judging from the granularity of pruning, it can be divided into:

Pruning is mainly to reduce unimportant parameters, so how to measure the importance of parameters. The classification of pruning methods here mainly comes from network pruning compressed by gossip model.

One of the simplest inspirations is to evaluate the importance according to the absolute value of parameters (or feature output), and then kill that part by greedy method, which is called amplitude-based weight pruning.

In this case, parameters are often sparse, so regularization R is added to the training loss, especially L 1 regularization R, which makes the weights sparse. For structured pruning, we want to get structured sparse weights, so we often use lasso to get structured sparse weights.

In order to prune the network with smaller parameters, we can reduce the number of adoption in the convolution layer, such as learning structured sparsity in deep neural networks. The parameters of BN layer can be sparsely trained, for example, 20 17' s paper "Learning Efficient Convection Network through Network Slimming". You can also output an activation function. Activation functions such as Relu often produce sparse activation, which can eliminate the previous channel layers, such as the 20 16 paper "network pruning: a data-driven neuron pruning method for efficient deep architecture"

This method assumes that the smaller the absolute value of the parameters, the smaller the influence on the final result. We call it the "smaller specification is less important" standard. However, this assumption may not hold (for example, it was discussed in the paper "Rethinking Miller Norm No Information Hypothesis in Channel Pruning of Convolution Layer" in 20 18).

In the early 1990s, the classic papers Optimal Brain Injury and Second Derivative of Network Pruning: Optimal Brain Surgeon proposed OBD and OBS methods respectively. They measure the importance of weight in the network according to the second derivative of loss function relative to weight (Hessian matrix of weight vector), and then cut it.

Both methods need to calculate Hessian matrix or its approximation, which is time-consuming. In recent years, some methods based on this idea have been studied and put forward. For example, the 20 16 paper "Pruning of Convective Neural Networks for Efficient Resource Transfer Learning" is also based on Taylor expansion, but the absolute value of the first-order term in the expansion of the objective function relative to the activation degree is used as the pruning criterion. This avoids the calculation of the second-order term (Hessian matrix). Snip:2065 438+08 Single shot network pruning based on connection sensitivity takes the absolute value of the derivative of normalized objective function relative to parameters as the measure of importance.

The influence on the reconstruction of feature output is to minimize the reconstruction error of the network for the clipped feature output. Its intuition is that if the current layer is cropped, then if it has no effect on the subsequent output, it means that less important information is cropped. Typical examples are the paper Thin:20 17 "Filtering-level Pruning Method for Deep Neural Network Compression" and "Accelerating Channel Pruning of Extremely Deep Neural Network". S determines which channel needs to be clipped by minimizing the feature reconstruction error.

A clipping article proposed by CVPR in 20 19, "Pruning by geometric median filter for deep convection neural network acceleration", re-examines the principle that small norm is not important. The basic requirements of this principle are

Because most networks can't meet this requirement, a new angle is put forward. If a filter can be represented by other filters in the same layer, it means that this layer is redundant, that is, it can be deleted. Deleting redundant layers has the least impact on the whole layer, and the information of redundant layers can be quickly recovered by other filters.

So what kind of filter is a redundant layer? Which layer can be characterized by other filters in the same layer?

The answer is the geometric center of the layer or a filter near the geometric center. At the CVPR Summit in 2008, the paper "Robust Statistics on Riemannian Manifolds via Geometric Median" mentioned that the geometric center can be described by other filters near the geometric center, which laid the theoretical foundation of this paper.

Pruning ratio is the ratio of the parameters after pruning to the original parameters.

Pruning can be divided into static pruning and dynamic pruning.

In the traditional static pruning strategy, previous studies have found that it is difficult to achieve the same accuracy as before pruning by reinitializing the weights from the network structure after pruning and then training. However, after each pruning, fewer rounds of finetune training can make the network structure after the same pruning reach or only slightly lower than that before pruning. So our understanding is that the network structure and the weight of network parameters left after pruning are very important.

However, the lottery hypothesis, the best paper of ICLR20 19, raised objections for the first time.

This paper conducted such an experiment:

However, another paper at the same conference in the same year, Rethinking the Value of Network Pruning, gave a similar but somewhat different view. This paper also denies the importance of retaining weights after pruning, but also denies the necessity of retaining initialization parameters. It is considered that the effect of the model obtained by finetune after pruning is often worse than that obtained by training the model directly from scratch, although the model of the structure after pruning from scratch often needs more training rounds.

Why is the experimental effect of training from scratch not as good as that of using parameters after pruning in the previous paper experiments? Because of the thought of taking it for granted, the previous experiment of "training from scratch" did not carefully select the superparameter and data expansion strategy, nor did it give enough calculation time and training rounds from scratch (because the author mentioned that training from scratch requires more rounds to obtain similar accuracy as finetune using pruning parameters)

Another conclusion of the experiment reflecting on the value of network pruning is that the function of network pruning is network structure search.

The author conducted five random trainings on the article "Learning Efficient Convection Network through Network Slimming", and found that if the pruning ratio is specified, the number of retained layers of each layer is always magically similar after five pruning, which shows that this method can really get a more efficient and stable structure.

The experiment reflecting on the value of network pruning shows that on VGG network, the network sparse pruning strategy is better than the proportional pruning of each layer, while on ResNet and DenseNet, the network sparse pruning strategy may not be as good as the proportional pruning of each layer.

The author analyzes the structure of these networks after pruning, and finds that this structure tends to prune all layers in equal proportion, which may be the reason why the effect of this strategy after pruning is roughly equivalent to that of pruning all layers in equal proportion. However, the redundancy of each layer of VGG network is unbalanced, so the pruning strategy is effective.

As the current research shows.