Introduction

By the end of this article you’ll know:

  • What is Gradient Clipping and how does it occur?
  • Types of Clipping techniques
  • How to implement it in Tensorflow and Pytorch
  • Additional research for you to read

Gradient Clipping solves one of the biggest problems that we have while calculating gradients in Backpropagation for a Neural Network.

You see, in a backward pass we calculate gradients of all weights and biases in order to converge our cost function. These gradients, and the way they are calculated, are the secret behind the success of Artificial Neural Networks in every domain.

But every good thing comes with some sort of caveats.

Gradients tend to encapsulate information they collect from the data, which also includes long-range dependencies in large text, or multidimensional data. So, while calculating complex data, things can go south really quickly, and you’ll blow your next million-dollar model in the process.

Luckily, you can solve it before it occurs (with gradient clipping) – let’s first look at the problem in-depth.

Common problems with backpropagation

The Backpropagation algorithm is the heart of all modern-day Machine Learning applications, and it’s ingrained more deeply than you think.

Backpropagation calculates the gradients of the cost function w.r.t – the weights and biases in the network.

It tells you about all the changes you need to make to your weights to minimize the cost function (it’s actually -1*∇ to see the steepest decrease, and +∇ would give you the steepest increase in the cost function).

Pretty cool, because now you get to adjust all the weights and biases according to the training data you have. Neat math, all good.

Vanishing gradients

What about deeper networks, like Deep Recurrent Networks?

The translation of the effect of a change in cost function© to the weight in an initial layer, or the norm of the gradient, becomes so small due to increased model complexity with more hidden units, that it becomes zero after a certain point.

This is what we call Vanishing Gradients.

This hampers the learning of the model. The weights can no longer contribute to the reduction in cost function©, and go unchanged affecting the network in the Forward Pass, eventually stalling the model.

Exploding gradients

On the other hand, the Exploding gradients problem refers to a large increase in the norm of the gradient during training.

Such events are caused by an explosion of long-term components, which can grow exponentially more than short-term ones.

This results in an unstable network that at best cannot learn from the training data, making the gradient descent step impossible to execute.

#deep learning #machine-learning

Understanding Gradient Clipping (and How It Can Fix Exploding Gradients Problem)
1.40 GEEK