The ins and outs of Gradient Descent

Gradient descent is an optimization algorithm used to minimize some cost function by iteratively moving in the direction of steepest descent. That is, moving in the direction which has the most negative gradient. In machine learning, we use gradient descent to continually tweak the parameters in our model in order to minimize a cost function. We start with some set of values for our model parameters (weights and biases in a neural network) and improve them slowly. In this blog post, we will start by exploring some basic optimizers commonly used in classical machine learning and then move on to some of the more popular algorithms used in Neural Networks and Deep Learning.

Imagine you’re out for a run (or walk, whatever floats your boat) in the mountains and suddenly a thick mist comes in impairing your vision. A good strategy to get down the mountain is to feel the ground in every direction and take a step in the direction in which the ground is descending the fastest. Repeating this you should end up at the bottom of the mountain (although you might also end up at something that looks like the bottom that isn’t — more on this later). This is exactly what gradient descent does: it measures the local gradient of the cost function J(ω_) (_parametrized by model parameters ω), and moves in the direction of descending gradient. Once we reach a gradient of zero, we have reached our minimum.

Suppose now that the steepness of the hill isn’t immediately obvious to you (maybe you’re too cold to feel your feet, use your imagination!), so you have to use your phone’s accelerometer to measure the gradient. It’s also really cold out so to check your phone you have to stop running and take your gloves off. Therefore, you need to minimize the use of your phone if you hope to make it down anytime soon! So you need to choose the right frequency at which you should measure the steepness of the hill so as not to go off track and at the same time reach home before the sunset. This amount of time between checks is what is known as the **learning rate **i.e. the size of steps we take downhill. If our learning rate is too small, then the algorithm will take a long time to converge. But, if our learning rate is too high the algorithm can diverge and just past the minimum.

#machine-learning #towards-data-science #data-science #mathematics

towardsdatascience.com

The ins and outs of Gradient Descent