How my coding background helped me understand complex analysis

I use many statistical tools, but I am not a statistician.  Although I took several stats courses in college and graduate school, I’m not familiar enough with statistical theory to understand what a complex analysis will do with a given data set beyond simple situations.

It’s a problem.  In my field, ecologists are primarily interested in biological systems and the complex interactions of species and their environment.  Experiments that address interesting problems can get complicated very quickly. In graduate school, I saw many ecology students apply very complicated analyses to their complex experimental designs, often because they knew it would be expected by their committee and journal reviewers.   You might say that “a sufficient understanding of the analysis was often lacking.”

I’m speaking from personal experience, of course. (There are many mathematically agile ecologists, including a solid bunch in my grad program.)  In my main experiments, I had to manipulate the diversity of species in a natural community and then follow how the community responded to different kinds of stresses.  At the time it was a novel experiment for fieldwork and I was absorbed by how to accomplish it.  I didn’t carefully think about the statistical design until the data started rolling in. (Big Mistake, but I’m sure I’m not alone.)

I got some advice how to design the analysis, but I struggled to understand how to interpret its results.  “Struggled” isn’t quite the word; I was sure I was a failure.  If I couldn’t understand my analysis, how could I possibly defend my work?

But then I took the “Community Analysis” course by Bruce McCune.  Bruce is one of those math-savvy ecologists who also has a knack for explaining how things work to the math-not-so-savvy.  The content of the course was multivariate analysis with an ecological focus.  But the big take-home lesson for me was:  explore your analysis tools however you can until you understand what they will do for you.   For me, that meant simulating all kinds of data, running that through the technique we were learning, and watching what it spit out.  That was something I could do!  (Thank you, Dr. McCune!)

This lead me to a series of tests of not only my analysis, but also the kinds of experimental designs ecologists were beginning to use to address the questions of function of diversity.  It was fascinating to me because it showed how easy it was to fool ourselves that we were seeing one thing from a particular experiment, when it could easily be something else, hidden by our choice of experimental design and analysis.

A few years after graduating at Oregon State, I inherited a large survey for which the analysis was intrinsically built into the sample design – the way it should be.  It was a complex analysis, the data were messy, and it was a massive effort for the field crew.  I was excited to be part of the project because of its scale and promise to yield vast patterns.  But the analysis was difficult for me to wrap my head around – a design to estimate the dominant spatial scale of variability using a nested ANOVA.   So I did what I knew how to do, I created simulated data sets with different scales of variability and applied them to the analysis scheme.   And sure enough, variability in one level was unexpectedly showing up in other levels in the results, often to a surprising degree.  When I posted my results to a forum of stats experts (R-help), I received validation: sure, without lots of replication, you would expect that to happen.  These complex analyses won’t work miracles with low power data. (Thank you, Dr. Lumley!)  The data we were collecting were valuable for lots of reasons, but probably not for identifying the dominant scale of variability.

I have found that using simulations of expected patterns and testing an analysis scheme with those data have helped me gain almost an intuitive understanding of what an analysis is really telling me.  As someone who depends on complicated analysis as a tool, such exploration has become invaluable.

 

How many cuts of a deck of cards does it take before the deck is shuffled? Maybe the answer will surprise you.

During the current covid-19 “stay-at-home” order, my family has been playing a lot of cards.  Inevitably, someone will complain that the dealer hasn’t shuffled enough or is over-shuffling.  And that leads to questions like “how can you even know if you have shuffled enough?”

Sounds like a job for simulations!  Let’s start with some very simple models:

If we assume that, instead of the typical deck of Ace, 2, 3, …, Q, K in four different suits, the cards here are just numbered from 0 to 51, we simplify the simulation problem.  Thus, the deck, when “new,” will be consecutively ordered from 0 to 51.

Furthermore, a simple measure of it’s likeness to the “new” state is the accumulated difference between consecutive cards:

Our “new” deck is defined as a simple list:

def ndeck():
   return list(range(52))

And looks like:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 
50, 51]

A shuffled deck is a new deck that is scrambled with the random.shuffle module of python:

def rdeck():
    d = ndeck()
    random.shuffle(d) 
    return d

… and will look something like:

[19, 45, 44, 40, 32, 33, 27, 37, 14, 10, 
34, 5, 23, 8, 17, 21, 48, 6, 47, 24, 
38, 15, 46, 29, 22, 12, 36, 42, 0, 41, 
3, 2, 13, 11, 51, 30, 9, 28, 20, 49, 
31, 16, 4, 50, 43, 39, 25, 1, 18, 7, 
26, 35]

We will calculate scores of our decks with:

def deck_score(deck):
    accum = 0
    for i,card in enumerate(deck):
        if i < len(deck)-1: # don't do final card
        accum += abs(card - deck[i+1])
        return accum

What scores do a new and a shuffled deck yield?

 

The new deck score is 51.

Shuffled decks (using the standard python module, “random”) gives us a range of scores. And the mean shuffled score is about 900.

 

 


Cutting the deck

Let’s start with a simple form of shuffling:  just cutting the deck.  This is the simple process of separating the deck into two piles of random thickness and then switching top and bottom halves.  A function to do that for us:

def cut_deck(deck,minThin=4):  # minThin controls how thin the piles can be
    lower = minThin # the lower and upper limits to where the cut can be in the deck
    upper = len(deck)-minThin
    cut = random.randint(lower,upper)  # location of the cut
    new = deck[cut:] + deck[:cut]  # bottom on top, top on the bottom
    return new

And it looks something like

[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 
50, 51, 0, 1, 2, 3, 4, 5, 6, 7, 
8, 9]

Pretty simple, eh?


A Null Hypothesis

A good place to start exploring is to guess what you are going to see as you start simulating.  In this case, my first hypothesis is that as we cut the same deck consecutively, its deck score will climb toward that mean of a shuffled deck and that it will take about 12 cuts to make a random looking deck.  Why 12?  Just a guess.

I think it will look something like:

 

 

 

 


Consecutive cuts

If a new deck gives us a score of 51, and a deck cut once yields 101, what does cutting a deck twice yield?

151?

l = []
d = ndeck()
l.append(deck_score(d))
for i in range(10):
    d = cut_deck(d)
    #print(d)
    l.append(deck_score(d))
print(l)

And the result of the consecutive cuts?

[51, 101, 101, 101, 101, 101, 
101, 101, 101, 101, 101]

Huh?!  Let’s look at a deck after a million cuts:

d = ndeck()
for i in range(1000000):
    d = cut_deck(d)

The result:

[39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 
49, 50, 51, 0, 1, 2, 3, 4, 5, 6, 
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 
37, 38]

Just a single cut!


Thinking a little more careful reveals what’s happening.  The process of the second cut actually reconnects the two cards that were cut in the first cut!

This was a surprise to me. Consecutive cutting doesn’t actually change things that much, only the distance between the most recently cut cards. This is because the next cut actually “heals” the previous cut!

Thinking about it now, it makes sense. But that has not been my long held belief that cutting multiple times results in a shuffled deck.  I admit that I had to pull out a real deck of cards to convince myself that this simulation result wasn’t an artifact.  But it is not.  Try it out.

Was this a surprise to you?  Just about everyone in my house guessed that the deck would be pretty shuffled after a bunch of cuts. Probably a thinking error that magicians exploit.

Digging into Random Forests

More than a year ago, I worked through a few Fast.ai courses:

It was a fantastic experience: an introduction to the nuts and bolts of applying machine intensive analysis to real problems in a way that one could see results almost immediately; a code-based approach (it helped that I already knew lots of Python) that got into the nitty-gritty of solving tough problems; an emphasis on working habits that led to an agile approach; and an amazingly active and collaborative community of students.  Somehow, whatever Jeremy Howard says always makes sense and is steeped in hard-won expertise.  It was a perfect fit for me and I learned a ton.

But honestly, I have not applied the Fast.ai material to my own projects much at all.  I think the simple reason is that there is an inertia I have to overcome when I start a Fast.ai type analysis: securing enough GPU computing power, making sure all the modules are up-to-date and aligned to the project at hand, relearning the processes that Jeremy outlined and adapting them to the project.  While I am sure that many Fast.ai students are plugged in enough to make those obstacles relatively trivial, my available time and resources just aren’t a good match to make all that work.

But I’ve realized that there should be a lot I can do towards implementing a machine learning approach without needing the heavy infrastructure, especially since the problems I’m likely to tackle aren’t necessarily the massive ones targeted in the courses.  Jeremy’s approach covered tools students need to tackle some of the biggest problems – but that brought lots of complexity.  I just need to pare it down to the basics –  Fast.ai on the cheap – and create a learning environment for myself in which I can make some progress.

The projects I have in mind right now would be best served by an exploration of Random Forest methodologies.  That’s where I will start…

 

An empirical exploration of the Central Limit Theorem

The way I understand it, the Central Limit Theorem states that a distribution of sampled means of ANY distribution will be normal.

Well, sort of.

From Wikipedia:

In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a “bell curve”) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

in another place:

The central limit theorem states that the sum of a number of independent and identically distributed random variables with finite variances will tend to a normal distribution as the number of variables grows.

In general, I only start to understand such concepts if I can play around with them.  So let’s see how this works! We will start by defining a function that will perform the sampling to create distributions of means.

In the code below, s in the function getSamp is the name of the distribution that we will be testing.


#preamble 
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

import pandas as pd
import numpy as np

nbins =100
def getSamp(s='normal(5,3,',n=1000,reps=10000):
    ddist = eval('np.random.'+s+'size=n)')
    xb = eval('np.random.'+s+'size=(n,reps))')
    f, (ax1,ax2) = plt.subplots(ncols=2,figsize=(10,5))
    ax1.set_title('Data distribution')
    ax1.hist(ddist,bins=nbins)
    ax2.set_title('Sample distribution of the mean for: '+s+')')
    mb = xb.mean(axis=0)
    return ax2.hist(mb,bins=nbins) # tmp holds the arrays used to calc

tmp = getSamp()

For the default (normal distribution, mean = 5, std. dev = 3), here’s the distribution and the results distribution of means:

Let’s define a list of distributions that we want to run against.


dist = ['normal(5,3,',
        'uniform(low=0,high=5,',
        'binomial(n=100,p=.5,',
        'logistic(',
        'laplace(',
        'exponential(',
        'lognormal(',
        'poisson(',
        'poisson(lam=50,',
        'power(a=2,',
        'power(a=50,',
        'power(a=10000,',
        'wald(mean=5,scale=1,']

Then run it:


for d in dist:
    tmp = getSamp(s=d)

For uniform and binomial distributions:

For exponential and lognormal distributions:

How about a couple of different poisson distributions?

That default poisson yields a normal-looking mean distribution, but clearly the highly discrete nature of the source distribution has an effect on the mean’s distribution.

But overall, it seems to hold that the distribution of the means are normal. (You might want to try other distributions for yourself…)

Alerting fracking companies to disclosure errors

During my regular updates of the raw FracFocus data set, I occasionally come across obvious errors.  This morning, it was a report of over 1 Billion gallons of water used in a single fracking job.  It would be great if there was a mechanism to inform companies of such errors so that they correct them.  A chemist friend tells me he has tried numerous times to communicate with companies about such problems, with no success.  I thought I would give it a try and document my attempts in my FracFocus blog.