The Beauty of Bayesian Optimization, Explained in Simple Terms

Here’s a function: f(x). It’s expensive to calculate, not necessarily an analytic expression, and you don’t know its derivative.

Your task: find the global minima.

This is, for sure, a difficult task, one more difficult than other optimization problems within machine learning. Gradient descent, for one, has access to a function’s derivatives and takes advantage of mathematical shortcuts for faster expression evaluation.

Alternatively, in some optimization scenarios the function is cheap to evaluate. If we can get hundreds of results for variants of an input x in a few seconds, a simple grid search can be employed with good results.

Alternatively, an entire host of non-conventional non-gradient optimization methods can be used, like particle swarming or simulated annealing.

Unfortunately, the current task doesn’t have these luxuries. We are limited in our optimization by several fronts, notably:

It’s expensive to calculate. Ideally we would be able to query the function enough to essentially replicate it, but our optimization method must work with a limited sampling of inputs.
The derivative is unknown. There’s a reason why gradient descent and its flavors still remain the most popular methods for deep learning, and sometimes, in other machine learning algorithms. Knowing the derivative gives the optimizer a sense of direction — we don’t have this.
We need to find the global minima, which is a difficult task even for a sophisticated method like gradient descent. Our model somehow will need a mechanism to avoid getting caught in local minima.

#data-science #statistics #mathematics #artificial-intelligence #machine-learning

towardsdatascience.com

The Beauty of Bayesian Optimization, Explained in Simple Terms