1597235220

Optimizer is a technique that we use to minimize the loss or increase the accuracy. We do that by finding the local minima of the cost function.

Our parameters are updated like this:

When our cost function is convex in nature having only one minima which is its global minima. We can simply use Gradient descent optimization technique and that will converge to global minima after a little tuning in hyper-parameters.

But in real world problems the cost function has lots of local minima. And the Gradient Descent technique fails here and we can end up in local minima instead of global minima.

So, to save our model from getting stuck in local minima we use an advanced version of Gradient Descent in which we use the momentum.

Imagine a ball, we started from some point and then the ball goes in the direction of downhill or descent. If the ball has the sufficient momentum than the ball will escape from the well or local minima in our cost function graph.

Gradient Descent with Momentum considers the past gradients to smooth out the update. It computes an exponentially weighted average of your gradients, and then use that gradient to update the weights.

#neural-networks #deep-learning #optimization-algorithms #algorithms

1599095520

In the 1940s, mathematical programming was synonymous with optimization. An optimization problem included an **objective function** that is to be maximized or minimized by choosing **input values** from an **allowed set of values** [1].

Nowadays, optimization is a very familiar term in AI. Specifically, in Deep Learning problems. And one of the most recommended optimization algorithms for Deep Learning problems is ** Adam**.

*Disclaimer: basic understanding of neural network optimization. Such as Gradient Descent and Stochastic Gradient Descent is preferred before reading.*

- Definition of Adam Optimization
- The Road to Adam
- The Adam Algorithm for Stochastic Optimization
- Visual Comparison Between Adam and Other Optimizers
- Implementation
- Advantages and Disadvantages of Adam
- Conclusion and Further Reading
- References

The Adam algorithm was first introduced in the paper **Adam: A Method for Stochastic Optimization** [2] by Diederik P. Kingma and Jimmy Ba. Adam is defined as “a method for efficient **stochastic optimization** that **only requires first-order gradients** with little memory requirement” [2]. Okay, let’s breakdown this definition into two parts.

First, **stochastic optimization** is the process of optimizing an objective function in the presence of *randomness*. To understand this better let’s think of Stochastic Gradient Descent (SGD). SGD is a great optimizer when we have a lot of data and parameters. Because at each step SGD calculates an estimate of the gradient from a *random subset of that data* (mini-batch). Unlike Gradient Descent which considers the entire dataset at each step.

#machine-learning #deep-learning #optimization #adam-optimizer #optimization-algorithms

1599022440

Optimization Algorithms for machine learning are often used as a black box. We will study some popular algorithms and try to understand the circumstances under which they perform the best.

- Understand
and its variants, such as _Momentum, Adagrad, RMSProp, NAG, _and*Gradient descent**Adam;* - Introduce techniques to improve the performance of Gradient descent; and
- Summarize the advantages and disadvantages of various optimization techniques.

*Hopefully, by the end of this read, you will find yourself equipped with In with intuitions towards the behavior of different algorithms, and when to use them.*

#optimization #deep-learning #rmsprop #adam #gradient-descent

1597882980

**Contents**

- Gradient descent advanced optimization techniques
- Autoencoders
- Dropout
- Pruning

In this post we will cover a few advanced techniques namely,

Momentum ,Adagrad, RMSProp ,Adam etc

A machine learning model can contain millions of parameters or dimensions. Therefore the cost function has to be optimized over millions of dimensions.

The goal is to obtain a global minimum of the function which will give us the best possible values to optimize our evaluation metric with the given parameters.

The odds of obtaining a local minima inmost of the dimensions a high dimensional space are low , we are much more likely to encounter saddle points.

**Saddle Points:**a **point** at which a function of two variables has partial derivatives equal to zero but at which the function has neither a maximum nor a minimum value.

In mathematics, a **saddle point** is a point on the surface of the graph of a function where the slopes(derivatives) in orthogonal directions are all zero (a crtitical point), but which is not a local extremum of the function. An example of a saddle point is when there is a critical point with a relative minimum along one axial direction (between peaks) and at a relative maximum along the crossing axis.

Saddle points can drastically slow down optimization process , In the diagrams shown below the stochastic gradient descent converges prematurely to a value which is below optimum. The other points are different optimization techniques

In gradient descent we take a step along the gradient in *each* dimension. In the first animation,using SGD we get stuck in the local minima of one dimension , while we are also at the local maxima of another dimension( Gradient is close to zero)

Because our step size in a given dimension is determined by the gradient value, we’re slowed down in the presence of local optima.

#rmsprop #adam #momentum #autoencoder #dropout

1597235220

Optimizer is a technique that we use to minimize the loss or increase the accuracy. We do that by finding the local minima of the cost function.

Our parameters are updated like this:

When our cost function is convex in nature having only one minima which is its global minima. We can simply use Gradient descent optimization technique and that will converge to global minima after a little tuning in hyper-parameters.

But in real world problems the cost function has lots of local minima. And the Gradient Descent technique fails here and we can end up in local minima instead of global minima.

So, to save our model from getting stuck in local minima we use an advanced version of Gradient Descent in which we use the momentum.

Imagine a ball, we started from some point and then the ball goes in the direction of downhill or descent. If the ball has the sufficient momentum than the ball will escape from the well or local minima in our cost function graph.

Gradient Descent with Momentum considers the past gradients to smooth out the update. It computes an exponentially weighted average of your gradients, and then use that gradient to update the weights.

#neural-networks #deep-learning #optimization-algorithms #algorithms

1624496700

While writing code and algorithms you should consider tail call optimization (TCO).

The tail call optimization is the fact of optimizing the recursive functions in order to avoid building up a tall **call stack**. You should as well know that some programming languages are doing tail call optimizations.

For example, Python and Java decided to don’t use TCO. While JavaScript allows to use TCO since ES2015-ES6.

Even if you know that your favorite language support natively TCO or not, I would definitely recommend you to assume that your compiler/interpreter will not do the work for you.

There are two famous methods to do a tail call optimization and avoid tall call stacks.

As you know recursions are building up the call stack so if we avoid such recursions in our algorithms it will will allow us to save on the **memory usage**. This strategy is called the **bottom-up** (we start from the beginning, while a recursive algorithm starts from the end after building a stack and works backwards.)

Let’s take an example with the following code (top-down — recursive code):

```
function product1ToN(n) {
return (n > 1) ? (n * product1ToN(n-1)) : 1;
}
```

As you can see this code has a problem: it builds up a **call stack** of size O(n), which makes our total memory cost O(n). This code makes us vulnerable to a **stack overflow error**, where the call stack gets too big and runs out of space.

In order to optimize our example we need to go bottom-down and remove the recursion:

```
function product1ToN(n) {
let result = 1;
for (let num = 1; num <= n; num++) {
result *= num;
}
return result;
}
```

This time we are not stacking up our calls in the **call stack**, and we do use a O(1) space complexity(with a O(n) time complexity).

#memoization #programming #algorithms #optimization #optimize