The Normal Distribution (or a Gaussian) shows up widely in statistics as a result of the Central Limit Theorem. Specifically, the Central Limit Theorem says that (in most common scenarios besides the stock market) anytime “a bunch of things are added up,” a normal distribution is going to result.

But why? Why that distribution? Why is it special? Why not some other distribution? Are there other statistical distributions where this happens?

Teaser: the answer is yes, there are other distributions that are special in the same way as the Normal distribution. The Normal distribution is still the most special because:

It requires the least math
It is the most common in real-world situations with the notable exception of the stock market

If you’re intrigued, read on! I’ll give an intuitive sketch of the Central Limit Theorem and a quick proof-sketch before diving into the Normal distribution’s oft-forgotten cousins.

The Central Limit Theorem

Here is a quick official statement:

Suppose you have n random variables X₁, X₂, … etc. representing a sample of size _n _from some population with population mean μ and finite variance σ². The population could follow any distribution at all.
We are interested in their mean X, which itself a random variable. (It is random because each time we take a sample of size n, we get a different result).
We already know that the mean X will have mean μ and variance σ²/n (this is true by the independence assumption and is a general property of random variables).
The central limit theorem says that when _n _is large (usually 40+ is close enough in real life) the mean X follows a normal distribution, no matter what the distribution of underlying population is.

Formally,

Image for post

Formal statement of the Central Limit Theorem

Where Φ represents the normal distribution with mean and variance as given. (You may be used to seeing the equivalent standard deviation σ/√n instead). The “in distribution” is a technical bit about how the convergence works. We’ll ignore such technicalities from here on out.

Why the Central Limit Theorem Shows Up

The Central Limit Theorem shows up in all sorts of places in real-world situations. For example, it’s a pretty reasonable assumption that your height can be expressed as the sum of a bunch of factors related to, among others:

How much milk you drank every day when you were 8 years old
How many X and/or Y chromosomes you have
Which variant of the GH1 gene you have
A whole bunch of other genes
Whether you slept in a Procrustean bed as a child

Take a whole bunch of factors, each of which makes a small difference in your final (adult) height, and, presto, you end up with a (roughly) normal distribution for human heights!

Note that I cheated slightly – the variables here aren’t i.i.d. But the independence assumption is a reasonable approximation, and there are stronger versions of the central limit theorem that relax the identical-distribution hypothesis. We did choose to leave out the cases of extreme genetic conditions that affect height however.

So in sum, any time something you measure is made up of a whole bunch of contributions from smaller parts being added up, you are likely to end up with a normal distribution.

A Quick Proof

This proof is necessarily a sketch because, well, if you want a full proof with all of the analysis and probability theory involved, go read a textbook. The main point I want to get across is that there is a reason Euler’s constant _e _shows up.

First of all, we will need one high-powered mathematical tool. To every reasonable random variable X there is a_ characteristic function φ _which is, in essence, the Fourier Transform of the Probability-Density Function (PDF) of X.

#mathematics #central-limit-theorem #statistics #data-science #data analysis

The Central Limit Theorem

Why the Central Limit Theorem Shows Up

A Quick Proof

towardsdatascience.com

Why is the Normal Distribution so Normal?