Central limit theorem

The central limit theorem is perhaps the most beautiful theorem in all statistics. By connecting the binomial distribution to the normal distribution, and otherwise, it answers the question of why the normal distribution comes up so often in statistics. In fact, by stating that means of any distribution are normally distributed, it appears to give the normal distribution its place as the king of all continuous distributions (or does it?)

The typical motivation for the central limit theorem comes from looking at large-sample distributions -- the distributions of the sums of two variables.

For instance, we've all seen what happens when we add two uniform distributions together:

Sum 0 1 2 3 4 5
0 0 1 2 3 4 5
1 1 2 3 4 5 6
2 2 3 4 5 6 7
3 3 4 5 6 7 8
4 4 5 6 7 8 9
5 5 6 7 8 9 10

If you graphed the distributions, they'd look like this:

Which is in itself a little troubling, because of the discontinuity. The distribution is really just a graph of the number of partitions of $X_2$ into two, where each partition is between 0 and 5.

A neat way to visualise this is to imagine a line passing across a square from one vertex to the opposite one, and track the length (divided by $\sqrt{2}$) of the line segment of intersection.

What about when you add three variables? Perhaps you think you may have two parabolas intersecting at an even sharper needlepoint, instead of two straight lines/a triangle. And everything would get even weirder.

Well, actually, it isn't. I encourage you to visualise this for yourself -- while the area starts out quadratically increasing (for about 1/6 of the journey), it eventually "hits" the faces of the cube and slows down in its increase.

Figure out the piecewise formula for the area, i.e. for the Irwin-Hall distribution. Hint: you can use dimensional analysis to show that each "piece" is a quadratic, even though a different one each time. But which quadratic?

The distribution of a sum of three variables therefore looks like this:

As you keep increasing the dimension of the space, the function approaches a function that looks suspiciously increasingly like the normal distribution. This puts the normal distribution in a pretty important role as a function: it, as a function of X, represents the (n - 1)-volume of the region of intersection between a plane $x_1+x_2+...=X$ and an n-cube, as n approaches infinity (which is called a Hilbert space).

Since the binomial distribution approaches a normal as the dimension approaches infinity (at least as long as you squish the standard deviation so the space becomes more and more continuous -- this is what happens in the final version of the central limit theorem, where you use averages instead of sums), it's fine to use areas and volumes to approximate "number of discrete points intersecting".

Our goal is to find the function that this approaches, to "derive" the normal distribution from this defining axiom. You may consider several possible ways of doing so -- one that I thought of was to differentiate the volume under the plane in the cube, and take the limit to infinity. Well, you can try it, but you'll probably fail.

Why can't we just use the formula for the number of partitions? Well, that would be like modelling the area as linear, quadratic, cubic, etc. -- it's only true for the first part of the distribution, after which we need to subtract off partitions with compartments larger than 5.

Perhaps you noticed, when sliding the plane across the cube, the line of intersection between the plane and the bottom face of the cube is exactly the line from the previous step -- the line moving across a square. If we do some fancy integral over this domain, it would seem like we'd get the volume under the plane from the area under the line.

Going back to the algebraic world, we retain this insight: recurrence relations seem to be the way to go to solve this problem.

So let's do it that way -- but first, to get an appreciation of what's going on, let's look again at the discrete, blocky distribution of sums we have at low values of n.

Like the sum of two dice-throws can be represented as a square of side 6, [0, 5]2, the sum of three dice-throws can be represented as a cube of side 6, [0, 5]3. The layers of the cube are shown below:

Layer $X_3=0$:
X2 \ X1 0 1 2 3 4 5
0 0 1 2 3 4 5
1 1 2 3 4 5 6
2 2 3 4 5 6 7
3 3 4 5 6 7 8
4 4 5 6 7 8 9
5 5 6 7 8 9 10

Layer $X_3=1$:
X2 \ X1 0 1 2 3 4 5
0 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11

Layer $X_3=2$:
X2 \ X1 0 1 2 3 4 5
0 2 3 4 5 6 7
1 3 4 5 6 7 8
2 4 5 6 7 8 9
3 5 6 7 8 9 10
4 6 7 8 9 10 11
5 7 8 9 10 11 12

Layer $X_3=3$:
X2 \ X1012345

Layer $X_3=4$:
X2 \ X1012345

Layer $X_3=5$:
X2 \ X1012345

If you were to track the presence of any individual number through this cube, you'd find that they do, in fact, form a plane. For example, we've boldened the "8"s that appear through the cube. The number of them is the height of the distribution at $S_3=8$.

So what is the number of '8's in the cube? Well, it's the number of '8's in the $X_3=0$ layer, plus the number of '8's in the $X_3=1$ layer, and so on. Now, this is interesting -- it seems like the number of '8's in the $X_3 = 1$ layer equals the number of '7's in the $X_3 = 0$ layer. Similarly, the number of '8's in the $X_3 = 2$ layer equals the number of '6's in the $X_3 = 0$ layer.

Think about why this is so. The reason will be important when we extend this to continuous distributions.

In general, if we let the frequency of a sum $S$ of n variables with a uniform discrete distribution on $[0, L]$ be $f_{S,n}$, then:

Which is the recursion we were looking for.

Perhaps this identity may remind you of some sort of a transformation of the hockey-stick identity on the Pascal triangle. In fact, if you let $p=1$, you will get exactly the Pascal triangle -- or a right-angled version thereof -- if you plot a table of $f_{S,n}$ along $S$ and $n$ with some initial conditions. Exploration of this is left as an exercise to the reader -- this is very important, it is where the intuition behind the link to binomial distributions comes from.

We can generalise this to a non-uniform base distribution -- just weight each frequency by the frequency $f_{S-K,n-1}$ of getting the number you need to bring the sum up from $S-K$ to $S$, $f_{K,1}$,

In fact, in this form, you don't even need to restrict the domain of the summation vary between $S$ and $S-L$ -- you can just write:

$${f_{S,n}} = \sum\limits_{k =  - \infty }^\infty  {{f_{k,1}}{f_{S - k,n - 1}}} $$
It's just that for $k$ outside the range of $0$ and $L$, $f_{k,1}$ is zero for the distribution we've been studying (a uniform distribution on the interval $[0,L]$). For more general distributions, this is not necessarily true, and the equation above is the general result.

Why? Link this to the bracketed exploration two steps before this one.

This result we have now, if you think about it, has really been pretty obvious all the while. All we're doing is summing up the probabilities of all possible combinations of $S_{n-1}$ and $S_1$. This result applies generally, to all possible distributions -- discrete distributions, but we will see the analog for continuous distributions in a moment.

We've been talking about frequencies all this while, but replacing $f$ with probabilities makes no change to the relation.

Why? Think about this one a bit geometrically -- in terms of the plane-cube thing.

To extend the relation to continuous distributions, however, we need to talk in terms of probability densities. In doing so, we write the probabilities as the product of a probability density and a differential, replacing $k$ with a continuous variable and the summation with an integral.

$${p_n}(S)\,dt = \int_{ - \infty }^\infty  {{p_1}(t){p_{n - 1}}(S - t)\,d{t^2}} $$

Why does it make sense to have a $dt$ differential on the left-hand side? Well, because $t$ is simply the variable on the axis on which a specific value of $S$ is marked -- the probability density is still obtained by dividing the probability by $dt$.

Dividing both sides by $dt$,

$${p_n}(S)\, = \int_{ - \infty }^\infty  {{p_1}(t){p_{n - 1}}(S - t)\,dt} $$
Now this is a very interesting result -- one can see that this is simply the convolution function:

$${p_n}(S) = {p_1}(S) * {p_{n - 1}}(S)$$
Well, how do we evaluate a convolution? Well, we take a Laplace transform, of course! So we get:

$$\begin{gathered}\mathcal{L}\left[ {{p_n}(S)} \right] = \mathcal{L}\left[ {{p_1}(S) * {p_{n - 1}}(S)} \right] \hfill \\
{\mathcal{P}_n}(\Omega) = {\mathcal{P}_1}(\Omega){\mathcal{P}_{n - 1}}(\Omega) \hfill \\
\end{gathered} $$

Or trivially solving the recurrence relation,

$${\mathcal{P}_n}(\Omega) = {\mathcal{P}_1}{(\Omega)^n}$$

Our challenge is this: does the Laplace transform of any probability density function, when taken to the $n$th power, always approach the Laplace transform of some given function as $n\to\infty$? This function, ${\mathcal{P}_1}{(\Omega)^n}$, turns out to be the normal distribution (or rather its Laplace transform).

Well, how do we evaluate ${\mathcal{P}_1}{(\Omega)^n}$? At first glance, it seems impossible that this always converges to the same function -- after all, ${\mathcal{P}_1}{(\Omega)}$ could be any function, right?

Not really. Think about the restrictive properties of a generating function/moment-generating function/"Laplace-transform of a probability distribution" -- a lot of them have to do with its value and the value of its derivatives at zero. This fact strongly suggests an approach involving a Taylor series.

Suppose we Taylor expand ${\mathcal{P}_1}{(\Omega)}$ as follows:

$${\mathcal{P}_1}(\Omega ) = \mathcal{P}_1^{(0)}(0) + \mathcal{P}_1^{(1)}(0)\Omega  + \mathcal{P}_1^{(2)}(0)\frac{{{\Omega ^2}}}{2} + ...$$
This is NOT the generating function of a discrete probability distribution. The coefficients do not represent any probabilities -- they are simply derivatives of the generating function evaluated at zero.

Now, by the properties of generating functions (compare each one to a property of generating functions of discrete variables -- except those involve 1 instead of 0, because they don't do ${e^s}$ in their definition),
  • ${\mathcal{P}_1}(0) = 1$ 
  • $\mathcal{P}_1^{(1)}(0) = \mu $
  • $\mathcal{P}_1^{(2)}(0) = \sigma^2$

There is something familiar with taking the limit $n\to\infty$ of an expression like

$${\mathcal{P}_n}(\Omega ) = {\left( {1 + \mu \Omega  + \frac{1}{2}{\sigma ^2}{\Omega ^2} + ...} \right)^n}$$
It might remind you of the old limit ${e^x} = {\left( {1 + \frac{x}{n}} \right)^n}$. If only we found a way to get the term on the inside to be "something" (some $x$) divided by $n$.

And well -- it turns out, there is a way. Remember -- when we're talking about the summed distribution, the mean and variance are $n\mu$ and $n\sigma^2$. We will represent these as $\mu_n$ and $\sigma_n^2$, so

$${\mathcal{P}_n}(\Omega ) = {\left( {1 + \frac{{{\mu _n}\Omega }}{n} + \frac{{\sigma _n^2{\Omega ^2}/2}}{n} + ...} \right)^n}$$
When you take the limit as $n\to\infty$, this approaches

$${\mathcal{P}_n}(\Omega ) = {e^{{\mu _n}\Omega  + \frac{1}{2}\sigma _n^2{\Omega ^2}}}$$
Which is precisely the moment-generating function of a normal distribution with mean ${{\mu _n}}=n\mu$ and variance ${\sigma _n^2}=n\sigma^2$. The distribution of the mean follows.

There may seem to be something off with our proof. We chose to "cut off" our Taylor series at the $\Omega^2$ term for no apparent reason -- if we had extended the series to include $\Omega^3$ term, we'd have gotten an additional term representing the "third moment" of the distribution (the zeroth moment is 1, the first moment is the mean and the second moment is the variance).

Indeed, we would've. And the distribution of the mean indeed does approach a skewed normal distribution with the skew being possible to calculate from the skew of the original distribution (much like the mean and variance are calculated from those of the original distribution). The skew would decrease rapidly, of course, even faster than the standard deviation decreases as $n$ increases.

Similarly if we'd stopped the series at $\Omega^4$, we'd get a "kurtosis" (the fourth moment), which would decrease even faster.

So which distribution does the mean actually approach?

All of them.

Think about the epsilon-delta definition of the limit. You can always define some amount of closeness and you'll be able to get an $n$ large enough to ensure that the distribution of the mean is close enough to any one of these distributions. The thing with functions is, they can approach a number of different functions at once, because all those functions also approach the same thing.

We can take the skew into account if we wanted to -- it's just that for big-enough values of $n$, this skew is really small. For even bigger values of $n$, the standard deviation is really small, too. Indeed, we can even say the distribution approaches a dirac delta function at the mean (called a degenerate distribution).

It seems that all functions not just can, but necessarily do approach a number of different functions at once -- i.e. you can always find multiple functions converging to the same thing. Are there any conditions on the function for this to be true? Think about the domain and co-domain.

We've used the phrases "Laplace transform", "generating function", "bilateral Laplace transform" and "moment generating function" interchangeably, but there are subtle differences in the way they're defined (even though they're all kinda isomorphic).
  • A generating function, while typically defined as something for discrete/integer-valued random variables, can be extended to continuous distributions pretty easily. It's equivalent to a moment-generating function if you write $z^X = e^{\Omega X}$ via a variable substitution.
  • When talking about probability distributions, we always take the bilateral Laplace transform, not the standard Laplace transform, because the integrals are often easier to evaluate (for instance, you can always integrate a normal distribution with non-zero mean from $-\infty$ to $\infty$ with the standard $\sqrt{\pi}$ business, but you can't do that from $0$ to $\infty$ -- if you change the variables, the lower limit changes).
  • Moment generating functions involve using ${e^{\Omega X}}$ instead of ${e^{ - \Omega X}}$ as the integrand. Thus a transformation of $\Omega\to-\Omega$ transforms between a moment generating function and a bilateral Laplace transform.
  • The characteristic function is a Wick rotation of the moment generating function, obtained via an integrand of ${e^{i\Omega X}}$.
  • The Fourier transform is similarly related to the characteristic function, it takes an integrand of ${e^{-i\Omega X}}$.

Here's a convenient table to remember them by:
$$E\left( {{e^{\Omega X}}} \right) = G\left( {{e^\Omega }} \right) = MG(\Omega ) = \mathcal{L}_b( - \Omega ) = \varphi (i\Omega ) = \mathcal{F}( - i\Omega )$$

No comments:

Post a Comment