Current location - Education and Training Encyclopedia - Graduation thesis - Learning rate in the process of reinforcement learning
Learning rate in the process of reinforcement learning
At present, deep learning uses a very simple first-order convergence algorithm, and gradient descent method, no matter how many adaptive optimization algorithms there are, is essentially a variety of gradient descent method, so the initial learning rate plays a decisive role in the convergence of deep networks. The following is the formula of gradient descent method:

W := w-\alpha \frac{\partial} loss (w).

Here? \ alpha? Is the learning rate. If the learning rate is too small, the network loss will decrease slowly. If the learning rate is too large, the range of parameter updating will be very large, which will lead to the convergence of the network to local optimum or directly start to increase the loss.

Section 3.3 of Leslie N. Smith's paper "Cyclic Learning Rate of Training Neural Networks" 20 15 describes an excellent method for finding the initial learning rate. I recommend you to read this paper, which contains some enlightening ideas for setting the learning rate.

In this paper, we use this method to estimate the minimum and maximum learning rate allowed by the network, and we can also use it to find our optimal initial learning rate. The method is simple. First, we set a small initial learning rate, such as 1e-5. Then we update the network after each batch, increase the learning rate, and count the calculated losses of each batch. Finally, we can draw the curve of learning and the curve of loss, from which we can find the best learning rate.

With the process of increasing the learning rate from small to large, the loss of the network will also change from a relatively large position to a relatively small position, and will increase at the same time. The corresponding situation is that the learning rate is too small, the loss decreases too slowly, and the learning rate is too large, and the loss may increase. From the above figure, we can find a relatively reasonable initial learning rate, 0. 1.

It can work, because the influence of small learning rate on parameter update is very small compared with large learning rate. For example, in the first iteration, the learning rate was 1e-5, and the parameters were updated. Then in the second iteration, the learning rate became 5e-5, and the parameters were updated again. Therefore, this parameter update can be regarded as the most original parameter, and the later learning rate is higher, and the parameters are updated. It is for this reason that the learning rate setting should be changed from small to large. If the learning rate setting is reversed, from large to small, then the loss curve is completely meaningless.