An empirical exploration of the Central Limit Theorem

The way I understand it, the Central Limit Theorem states that a distribution of sampled means of ANY distribution will be normal.

Well, sort of.

From Wikipedia:

In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a “bell curve”) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

in another place:

The central limit theorem states that the sum of a number of independent and identically distributed random variables with finite variances will tend to a normal distribution as the number of variables grows.

In general, I only start to understand such concepts if I can play around with them.  So let’s see how this works! We will start by defining a function that will perform the sampling to create distributions of means.

In the code below, s in the function getSamp is the name of the distribution that we will be testing.


#preamble 
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

import pandas as pd
import numpy as np

nbins =100
def getSamp(s='normal(5,3,',n=1000,reps=10000):
    ddist = eval('np.random.'+s+'size=n)')
    xb = eval('np.random.'+s+'size=(n,reps))')
    f, (ax1,ax2) = plt.subplots(ncols=2,figsize=(10,5))
    ax1.set_title('Data distribution')
    ax1.hist(ddist,bins=nbins)
    ax2.set_title('Sample distribution of the mean for: '+s+')')
    mb = xb.mean(axis=0)
    return ax2.hist(mb,bins=nbins) # tmp holds the arrays used to calc

tmp = getSamp()

For the default (normal distribution, mean = 5, std. dev = 3), here’s the distribution and the results distribution of means:

Let’s define a list of distributions that we want to run against.


dist = ['normal(5,3,',
        'uniform(low=0,high=5,',
        'binomial(n=100,p=.5,',
        'logistic(',
        'laplace(',
        'exponential(',
        'lognormal(',
        'poisson(',
        'poisson(lam=50,',
        'power(a=2,',
        'power(a=50,',
        'power(a=10000,',
        'wald(mean=5,scale=1,']

Then run it:


for d in dist:
    tmp = getSamp(s=d)

For uniform and binomial distributions:

For exponential and lognormal distributions:

How about a couple of different poisson distributions?

That default poisson yields a normal-looking mean distribution, but clearly the highly discrete nature of the source distribution has an effect on the mean’s distribution.

But overall, it seems to hold that the distribution of the means are normal. (You might want to try other distributions for yourself…)