Optimizers are algorithms or methods used to change the attributes of the neural network such as weights and learning rates in order to reduce the losses.

Different types of optimizers:

  1. Batch Gradient Descent (BGD)
  2. Stochastic gradient descent (SGD)
  3. Mini-batch gradient descent (MBGD)
  4. Adagrad
  5. Adadelta
  6. RMSProp
  7. Adam

Here I’m going to discuss all of them in detail:

  1. Batch Gradient Descent (BGD):

Gradient update rule:

BGD uses the data of the entire training set to calculate the gradient of the cost function to the parameters.

Disadvantages:

Because this method calculates the gradient for the entire data set in one update, the calculation is very slow, it will be very tricky to encounter a large number of data sets, and you cannot invest in new data to update the model in real-time.

Batch gradient descent can converge to a global minimum for convex functions and to a local minimum for non-convex functions.

2. Stochastic gradient descent (SGD) :

Gradient update rule:

Compared with BGD’s calculation of gradients with all data at one time, SGD updates the gradient of each sample with each update.

x +=-learning_rate * dx

where x is a parameter, dx is the gradient and learning rate is constant

For large data sets, there may be similar samples, so BGD calculates the gradient. There will be redundancy, and SGD is updated only once, there is no redundancy, it is faster, and new samples can be added.

Disadvantages:

However, because SGD is updated more frequently, the cost function will have severe oscillations.

BGD can converge to a local minimum, of course, the oscillation of SGD may jump to a better local minimum.

When we decrease the learning rate slightly, the convergence of SGD and BGD is the same.

3. Mini-batch gradient descent (MBGD) :

Gradient update rule:

MBGD uses a small batch of samples, that is, n samples to calculate each time. In this way, it can reduce the variance when the parameters are updated, and the convergence is more stable.

It can make full use of the highly optimized matrix operations in the deep learning library for more efficient gradient calculations.

The difference from SGD is that each cycle does not act on each sample, but a batch with n samples.

Setting the value of hyper-parameters: n Generally value is 50 ~ 256.

Disadvantages:

Mini-batch gradient descent does not guarantee good convergence,

If the learning rate is too small, the convergence rate will be slow. If it is too large, the loss function will oscillate or even deviate at the minimum value.

One measure is to set a larger learning rate. When the change between two iterations is lower than a certain threshold, the learning rate is reduced.

4. Adagrad:

Adagrad is an algorithm for gradient-based optimization which adapts the learning rate to the parameters, using low learning rates for parameters associated with frequently occurring features, and using high learning rates for parameters associated with infrequent features.

#artificial-intelligence #machine-learning #optimizer #deep-learning #artificial-neural-network

Optimizers Optimizers are algorithms , change the attributes of the neural network
1.50 GEEK