summary
Gradient descent optimization algorithm is the most commonly used optimization algorithm for neural network model training. For deep learning models, gradient descent algorithm is basically used for optimization training. Principle behind gradient descent algorithm: objective function
among
For the neural network model, the gradient can be calculated efficiently with the help of BP algorithm, so as to implement the gradient descent algorithm. However, the gradient descent algorithm has a long-standing problem: it can not guarantee global convergence. If this problem is solved, the world of deep learning will be much more harmonious. In principle, the gradient descent algorithm can converge to the global optimum for convex optimization problems, because there is only a unique local optimum. In fact, the deep learning model is a complex nonlinear structure, which generally belongs to a nonconvex problem, which means that there are many local optima (saddle points), and the gradient descent algorithm may fall into local optima, which should be the most troublesome problem. This is very similar to evolutionary algorithms such as genetic algorithms, which can not guarantee convergence to global optimization. Therefore, we are destined to become "senior adjusters" on this issue. It can be seen that an important parameter in the gradient descent algorithm is the learning rate, and the appropriate learning rate is very important: when the learning rate is too small, the convergence speed is slow, and when it is too large, it will lead to training shock and may diverge. The ideal gradient descent algorithm should meet two points: the convergence speed should be fast; Can globally converge. For this ideal, there are many variants of the classical gradient descent algorithm, which will be introduced below.
01
Momentum opTImization
The impulse gradient descent algorithm was proposed by Boris polyak in 1964. It is based on the physical fact that when a small ball is rolled down from the top of the mountain, its initial rate is very slow, but the rate increases rapidly under the action of acceleration, and finally reaches a stable rate due to the existence of resistance. For the impulse gradient descent algorithm, the update equation is as follows:
It can be seen that when updating parameters, not only the current gradient value is considered, but also an accumulation term (impulse) is added, but a super parameter is added
02
NAG
The full name of Nag algorithm is nesterov accelerated gradient. It is an improved version of impulse gradient descent algorithm proposed by yurii nesterov in 1983, and its speed is faster. The change is to calculate the update impulse term of "advance gradient", and the specific formula is as follows:
Since the parameters are going along
Figure 1 nag effect
03
AdaGrad
Adagrad is a gradient descent algorithm with adaptive learning rate proposed by Duchi in 2011. In the training iteration process, the learning rate is gradually attenuated, and the learning rate of frequently updated parameters decays faster, which is an adaptive algorithm. The update process is as follows:
Where is the accumulation of gradient square. During parameter updating, the learning rate should be divided by the square root of this accumulation, and a small value is added to prevent division by 0. Since this term increases gradually, the learning rate decreases. Considering the situation shown in Fig. 2, the slope of the objective function in two directions is different. If the original gradient descent algorithm is used, the convergence speed is relatively slow when approaching the slope bottom. When adagrad is adopted, this situation can be changed. Because the steep direction gradient is large, its learning rate will decay faster, which is conducive to the parameter moving closer to the slope bottom, so as to accelerate the convergence.
Fig. 2 Effect Drawing of adagrad
As mentioned earlier, adagrad's learning rate is actually declining, which will lead to a big problem, that is, the learning rate in the later stage of training is very small, resulting in the premature stop of training. Therefore, adagrad will not be adopted in practice. The following algorithm will improve this fatal defect. However, tensorflow also provides this optimizer: TF. Train. Adagradoptimizer.
04
RMSprop
Rmsprop was mentioned by Hinton in his course. It is an improvement of adagrad algorithm, mainly to solve the problem of too fast attenuation of learning rate. In fact, the idea is very simple. Similar to momentum's idea, a super parameter is introduced to attenuate in the accumulated gradient square term:
It can be considered that only the gradient close to time is accumulated, in which the general value is 0.9. In fact, this is an exponential decay mean term, which reduces the explosion, so it helps to avoid the rapid decline of learning rate. Hinton also recommends that the learning rate be set to 0.001. Rmsprop is a good optimization algorithm, and it certainly has its figure in tensorflow: tf.train.rmsprop optimizer (learning_rate = learning_rate, momentum = 0.9,
I have to say something out of the topic. There is also an adadelta algorithm in the same period, which is also an improvement of adagrad algorithm, and the improvement idea is very similar to rmsprop, but it is based on the idea of primary gradient approximation instead of secondary gradient. Those interested can see the corresponding papers, which will not be repeated here.
05
Adam, the full name of adaptive motion estimation, is a new optimization algorithm proposed by Kingma et al. In 2015, which combines the idea of momentum and rmsprop algorithm. Compared with momentum algorithm, its learning rate is adaptive, while compared with rmsprop, it increases the impulse term. Therefore, Adam is a combination of the two:
It can be seen that the first two items are very consistent with momentum and rmsprop. Since the initial value of and is generally set to 0, it may be small in the early stage of training. The third and fourth items are mainly to enlarge them. The last item is parameter update. Where the recommended value of the super parameter is. ADM is an algorithm with very good performance. In tensorflow, its implementation is as follows:
Learning rate
The problem of learning rate has also been mentioned earlier. For gradient descent algorithm, this should be the most important super parameter. If the learning rate is set very large, the training may not converge and diverge directly; If the setting is small, although it can converge, the training time may not be acceptable; If the setting is slightly higher, the training speed will be fast, but when it is close to the best point, it will vibrate and even cannot be stable. The selection of different learning rates may have a great impact, as shown in Figure 3.
Figure 3
Training effects of different learning rates
The ideal learning rate is: set it large at the beginning, have a fast convergence speed, and then decay slowly to ensure that it reaches the best point stably. Therefore, many of the previous algorithms are learning rate adaptive. In addition, such an adaptive process can be implemented manually, such as exponential attenuation of learning rate:
In tensorflow, you can do this:
summary
This paper briefly introduces the classification of gradient descent algorithm and common improved algorithms. In summary, the algorithm with adaptive learning rate, such as rmsprop and Adam algorithm, is preferred, and its effect is good in most cases. We must also pay special attention to the problem of learning rate. In fact, there are many aspects that will affect the gradient descent algorithm, such as the disappearance and explosion of the gradient, which also needs additional attention. Finally, it has to be said that the gradient descent algorithm can not guarantee the global convergence at present, and it will be a persistent mathematical problem.
reference
Anoverview of gradient descent optimization algorithms:
.
NAG:.
Adagrad:.
RMSprop:tijmen/csc321/slides/lecture_ slides_ lec6.pdf.
Adadelta: https://arxiv.org/pdf/1212.5701v1.pdf.
Adam: https://arxiv.org/pdf/1412.6980.pdf.
Visualization of different algorithms: https://imgur.com/a/Hqolp.
Welcome to join the group and discuss in the group
Welcome to leave a message or appreciate.
PUSH
Recommend
Read
read
Object
Detection R-CNN
Scan personal micro signals,
Pull you into the machine learning herd.
The welfare is full and the quota is not much
80% of AI practitioners have been concerned about our WeChat official account.