What is Bayesian Machine Learning?

Bayesian Machine Learning (also known as Bayesian ML) is a systematic approach to construct statistical models, based on Bayes’ Theorem.

Any standard machine learning problem includes two primary datasets that need analysis:

  1. A comprehensive set of training data
  2. A collection of all available inputs and all recorded outputs

The traditional approach to analysing this data for modelling is to determine some patterns that can be mapped between these datasets. An analyst will usually splice together a model to determine the mapping between these, and the resultant approach is a very deterministic method to generate predictions for a target variable.

The only problem is that there is absolutely no way to explain what is happening inside this model with a clear set of definitions. All that is accomplished, essentially, is the minimisation of some loss functions on the training data set – but that hardly qualifies as true modelling.

An ideal (and preferably, lossless) model entails an objective summary of the model’s inherent parameters, supplemented with statistical easter eggs (such as confidence intervals) that can be defined and defended in the language of mathematical probability. This “ideal” scenario is what Bayesian Machine Learning sets out to accomplish.

The Goals (And Magic) Of Bayesian Machine Learning

The primary objective of Bayesian Machine Learning is to estimate the posterior distribution, given the likelihood (a derivative estimate of the training data) and the prior distribution.

When training a regular machine learning model, this is exactly what we end up doing in theory and practice. Analysts are known to perform successive iterations of Maximum Likelihood Estimation on training data, thereby updating the parameters of the model in a way that maximises the probability of seeing the training data, because the model already has prima-facie visibility of the parameters.

It leads to a chicken-and-egg problem, which Bayesian Machine Learning aims to solve beautifully.

Things take an entirely different turn in a given instance where an analyst seeks to _maximise _the posterior distribution, assuming the training data to be fixed, and thereby determining the probability of any parameter setting that accompanies said data. This process is called Maximum A Posteriori, shortened as MAP. An easier way to grasp this concept is to think about it in terms of the likelihood function.

Taking Bayes’ Theorem into account, the posterior can be defined as:

In this scenario, we leave the denominator out as a simple anti-redundancy measure. Anything which does not cause dependence on the model can be ignored in the maximisation procedure. This key piece of the puzzle, prior distribution, is what allows Bayesian models to stand out in contrast to their classical MLE-trained counterparts.

Analysts can often make reasonable assumptions about how well-suited a specific parameter configuration is, and this goes a long way in encoding their beliefs about these parameters even before they’ve seen them in real-time. It’s relatively commonplace, for instance, to use a Gaussian prior over the model’s parameters.

The analyst here is assuming that these parameters have been drawn from a normal distribution, with some display of both mean and variance. This sort of distribution features a classic bell-curve shape, consolidating a significant portion of its mass, impressively close to the mean.

On the other hand, occurrences of values towards the tail-end are pretty rare. The use of such a prior, effectively states the belief that _a majority of the model’s weights must fit within a defined narrow range, _very close to the mean value with only a few exceptional outliers. This is a reasonable belief to pursue, taking real-world phenomena and non-ideal circumstances into consideration.

The effects of a Bayesian model, however, are even more interesting when you observe that the use of these prior distributions (and the MAP process) generates results that are staggeringly similar, if not equal to those resolved by performing MLE in the classical sense, aided with some added regularisation.

It’s very amusing to note that just by constraining the “accepted” model weights with the prior, we end up creating a regulariser.

On the whole, Bayesian Machine Learning is evolving rapidly as a subfield of machine learning, and further development and inroads into the established canon appear to be a rather natural and likely outcome of the current pace of advancements in computational and statistical hardware.

#artificial intelligence #bayesian

Bayesian Machine Learning - Exploring A Paradigm Shift In Statistical Data Modelling
1.05 GEEK