All you need to know about Gradient Descent

Introduction

Greetings, In this blog, I will be talking about gradient descent. It is one of the basic topics that must be known to the person who is studying machine learning. I will try to explain it in a very simple way and different types of Gradient Descent with mathematical equations. Let’s get started!

What is Gradient Descent?

Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to update weight parameters iteratively to minimize a cost function.

Suppose you are lost in the mountains in a dense fog; you can only feel the slope of the ground below your feet. A good strategy to get to the bottom of the valley quickly is to go downhill in the direction of the steepest slope. This is exactly what Gradient Descent does: it measures the local gradient of the error function with regards to the parameter vector θ, and it goes in the direction of descending gradient. Once the gradient is zero, you have reached a minimum!

Generally, at first, we initialize θ values randomly and then taking baby steps to minimize cost function (e.g MSE or RMSE) until we reach a global minimum.

Learning rate is the most important parameter in Gradient Descent. It determines the size of the steps. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time.

On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before. This might make the algorithm diverge, with larger and larger values, failing to find a good solution.

On the left, the learning rate is too low: the algorithm will eventually reach the solution, but it will take a long time. In the middle, the learning rate looks pretty good: in just a few iterations, it has already converged to the solution. On the right, the learning rate is too high: the algorithm diverges, jumping all over the place and actually getting further and further away from the solution at every step.

Finally, not all cost functions look like a nice regular bowl. There may be holes, ridges, plateaus, and all sorts of irregular terrains, which makes it difficult to find a global minimum.

Challenges with Gradient Descent.

The above image shows the two main challenges with Gradient Descent: if the random initialization starts the algorithm on the left, then it will converge to a local minimum, which is not as good as the global minimum. If it starts on the right, then it will take a very long time to cross the plateau, and if you stop too early you will never reach the global minimum.

To implement Gradient Descent, we need to compute how much the cost function will change if we change θj just a little bit. This is called a partial derivative. The below equation is for only one feature of a row. We have to calculate this for each row of each feature.

#machine-learning #data-science #gradient-descent #deep-learning

Introduction

What is Gradient Descent?

medium.com

All you need to know about Gradient Descent