query-FF: a CodeOcean project to ease FracFocus exploration

I’ve started another CodeOcean project related to open-FF. Whereas open-FF creates research-grade data sets from the FracFocus website, query-FF lets any user explore those data without being  a programmer.  While it may not be as easy as a point-and-click operation, users can write small scripts based on examples in the documentation to build custom data sets and to perform basic explorations.

Here is a simple example:

This set of commands, executed directly in the CodeOcean environment, makes a data set for 2-butoxy-ethanol for fracking events in Ohio for the years 2016-18.

Furthermore, because the CodeOcean environment is a full-blown anaconda/python/pandas environment, the user can write code directly to access the intricacies of the data without going through all the trouble of setting up their own software environment.

Hopefully, this will make the fracking chemical data accessible to many more people.

Scrubbing the black off of coal: Grappling with sub-standard data

When my kids were young, we often read “Gullible’s Troubles” by Margaret Shannon.  The protagonist of the book is a guinea pig named Gullible that believes anything he’s told.  At one point, his uncle tells him to clean the black off of a pile of coal.  Gullible dutifully scrubs and scrubs each lump and is shocked when there is nothing left.

From Gullible's Troubles by Margaret Shannon

From Gullible’s Troubles by Margaret Shannon

Sometimes when I work to clean poor quality data, I feel like Gullible: is there going to be anything worthwhile left when I am finished cleaning?

My most recent extended data project is a good example.  The FracFocus website is touted by the fracking industry to be the primary resource of chemical disclosures for a large fraction of fracking jobs in the US since 2011.  Unfortunately, it is barely more than a loose collection of disclosure forms, in multiple formats (barely documented), with little error checking and often with missing values.

When I first discovered these disclosures, I was amazed at the breadth of the coverage: about 150,000 fracking events in most active areas in the US, something like 4,000,000 individual chemical records.   When I began to make simple plots, it was intriguing, but there were also a lot of disclosures with missing values.  When I started to examine individual chemicals, I noticed that there were many that were labeled incorrectly or ambiguously.  So, I started writing routines to either clean up the problems or at least flag them so I could focus on the good data.

As months passed by, more and more weakness of the data set emerged and I had the impression that I was correcting or flagging just about everything.  That’s when I started thinking about Gullible’s Troubles.  My original intention was to transform these disclosures into a usable research data set, but I started to wonder if I would be left with just a long list of problems.  Certainly, something like that is useful as a critique of a slipshod disclosure instrument, but could I say nothing about the big picture of fracking chemicals in the US?

The answer, fortunately, is that there is a huge amount one can learn from these data, especially when considering the big picture of chemical use.  While it may be hard to make definitive claims about specific fracking events, especially when event data are spotty or suspicious, there are so many fracking events in the data set, spread over so many operating companies and oilfield service companies, that larger patterns emerge.  For example, we may not be able to know for certain what mass of methanol (CAS RN: 67-56-1) was used for a particular fracking operation, but there are over 150,000 individual records of its use in the data set.  So, for example, we can compare quantities that different companies use:

We find, for example, that the largest 10% of uses are over 7000 pounds and that it is not that unusual to see uses over 100,000 pounds.  Could we have learned that from any other source?

This disclosure instrument should be much cleaner than it is and it should be far more accessible to interested parties.  Many critiques of FracFocus correctly identify weaknesses and even severe problems with it.  One could easily come away with the impression that it is a waste of time to investigate.  But, it is not useless; with some work, we can pull important insights from these sub-standard data.

Sometimes it pays to be a bit gullible.

First-occurrence of COVID cases and deaths by US county

When the New York Times made its compilation of covid cases available, I decided to take a look.  Although it seems like EVERYONE is doing covid analysis, there were a few things that I wasn’t seeing.  For instance, the US is a huge country and I imagine that the sense of threat from covid is going to be related to how close the virus is geographically to a given person.

At the state level, at the beginning of March, only about 10 states were showing cases, but within a week or so, almost all states registered cases.  By the beginning of April, just about all states registered a covid death:

Because the NYT data are reported down to the county level, we can get better geographic resolution.  At this level, we see there are still quite a few counties that haven’t registered a case and less than half have seen a covid death (As of May 6, 2020).

How about looking at the number of people in those counties?  For that, I pulled in the 2010 Census numbers for a quick comparison (I removed territories for this analysis).  The total population for 2010 is 309,585,169.  Here we see that just about all people are in counties with cases and about 85% of people are in counties that have registered deaths – many of those ‘spared’ counties have very few people.

 

And here are the days of first cases and deaths:

(Here is a version of the Jupyter notebook I used for these analyses.)

How my coding background helped me understand complex analysis

I use many statistical tools, but I am not a statistician.  Although I took several stats courses in college and graduate school, I’m not familiar enough with statistical theory to understand what a complex analysis will do with a given data set beyond simple situations.

It’s a problem.  In my field, ecologists are primarily interested in biological systems and the complex interactions of species and their environment.  Experiments that address interesting problems can get complicated very quickly. In graduate school, I saw many ecology students apply very complicated analyses to their complex experimental designs, often because they knew it would be expected by their committee and journal reviewers.   You might say that “a sufficient understanding of the analysis was often lacking.”

I’m speaking from personal experience, of course. (There are many mathematically agile ecologists, including a solid bunch in my grad program.)  In my main experiments, I had to manipulate the diversity of species in a natural community and then follow how the community responded to different kinds of stresses.  At the time it was a novel experiment for fieldwork and I was absorbed by how to accomplish it.  I didn’t carefully think about the statistical design until the data started rolling in. (Big Mistake, but I’m sure I’m not alone.)

I got some advice how to design the analysis, but I struggled to understand how to interpret its results.  “Struggled” isn’t quite the word; I was sure I was a failure.  If I couldn’t understand my analysis, how could I possibly defend my work?

But then I took the “Community Analysis” course by Bruce McCune.  Bruce is one of those math-savvy ecologists who also has a knack for explaining how things work to the math-not-so-savvy.  The content of the course was multivariate analysis with an ecological focus.  But the big take-home lesson for me was:  explore your analysis tools however you can until you understand what they will do for you.   For me, that meant simulating all kinds of data, running that through the technique we were learning, and watching what it spit out.  That was something I could do!  (Thank you, Dr. McCune!)

This lead me to a series of tests of not only my analysis, but also the kinds of experimental designs ecologists were beginning to use to address the questions of function of diversity.  It was fascinating to me because it showed how easy it was to fool ourselves that we were seeing one thing from a particular experiment, when it could easily be something else, hidden by our choice of experimental design and analysis.

A few years after graduating at Oregon State, I inherited a large survey for which the analysis was intrinsically built into the sample design – the way it should be.  It was a complex analysis, the data were messy, and it was a massive effort for the field crew.  I was excited to be part of the project because of its scale and promise to yield vast patterns.  But the analysis was difficult for me to wrap my head around – a design to estimate the dominant spatial scale of variability using a nested ANOVA.   So I did what I knew how to do, I created simulated data sets with different scales of variability and applied them to the analysis scheme.   And sure enough, variability in one level was unexpectedly showing up in other levels in the results, often to a surprising degree.  When I posted my results to a forum of stats experts (R-help), I received validation: sure, without lots of replication, you would expect that to happen.  These complex analyses won’t work miracles with low power data. (Thank you, Dr. Lumley!)  The data we were collecting were valuable for lots of reasons, but probably not for identifying the dominant scale of variability.

I have found that using simulations of expected patterns and testing an analysis scheme with those data have helped me gain almost an intuitive understanding of what an analysis is really telling me.  As someone who depends on complicated analysis as a tool, such exploration has become invaluable.

 

How many cuts of a deck of cards does it take before the deck is shuffled? Maybe the answer will surprise you.

During the current covid-19 “stay-at-home” order, my family has been playing a lot of cards.  Inevitably, someone will complain that the dealer hasn’t shuffled enough or is over-shuffling.  And that leads to questions like “how can you even know if you have shuffled enough?”

Sounds like a job for simulations!  Let’s start with some very simple models:

If we assume that, instead of the typical deck of Ace, 2, 3, …, Q, K in four different suits, the cards here are just numbered from 0 to 51, we simplify the simulation problem.  Thus, the deck, when “new,” will be consecutively ordered from 0 to 51.

Furthermore, a simple measure of it’s likeness to the “new” state is the accumulated difference between consecutive cards:

Our “new” deck is defined as a simple list:

def ndeck():
   return list(range(52))

And looks like:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 
50, 51]

A shuffled deck is a new deck that is scrambled with the random.shuffle module of python:

def rdeck():
    d = ndeck()
    random.shuffle(d) 
    return d

… and will look something like:

[19, 45, 44, 40, 32, 33, 27, 37, 14, 10, 
34, 5, 23, 8, 17, 21, 48, 6, 47, 24, 
38, 15, 46, 29, 22, 12, 36, 42, 0, 41, 
3, 2, 13, 11, 51, 30, 9, 28, 20, 49, 
31, 16, 4, 50, 43, 39, 25, 1, 18, 7, 
26, 35]

We will calculate scores of our decks with:

def deck_score(deck):
    accum = 0
    for i,card in enumerate(deck):
        if i < len(deck)-1: # don't do final card
        accum += abs(card - deck[i+1])
        return accum

What scores do a new and a shuffled deck yield?

 

The new deck score is 51.

Shuffled decks (using the standard python module, “random”) gives us a range of scores. And the mean shuffled score is about 900.

 

 


Cutting the deck

Let’s start with a simple form of shuffling:  just cutting the deck.  This is the simple process of separating the deck into two piles of random thickness and then switching top and bottom halves.  A function to do that for us:

def cut_deck(deck,minThin=4):  # minThin controls how thin the piles can be
    lower = minThin # the lower and upper limits to where the cut can be in the deck
    upper = len(deck)-minThin
    cut = random.randint(lower,upper)  # location of the cut
    new = deck[cut:] + deck[:cut]  # bottom on top, top on the bottom
    return new

And it looks something like

[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 
40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 
50, 51, 0, 1, 2, 3, 4, 5, 6, 7, 
8, 9]

Pretty simple, eh?


A Null Hypothesis

A good place to start exploring is to guess what you are going to see as you start simulating.  In this case, my first hypothesis is that as we cut the same deck consecutively, its deck score will climb toward that mean of a shuffled deck and that it will take about 12 cuts to make a random looking deck.  Why 12?  Just a guess.

I think it will look something like:

 

 

 

 


Consecutive cuts

If a new deck gives us a score of 51, and a deck cut once yields 101, what does cutting a deck twice yield?

151?

l = []
d = ndeck()
l.append(deck_score(d))
for i in range(10):
    d = cut_deck(d)
    #print(d)
    l.append(deck_score(d))
print(l)

And the result of the consecutive cuts?

[51, 101, 101, 101, 101, 101, 
101, 101, 101, 101, 101]

Huh?!  Let’s look at a deck after a million cuts:

d = ndeck()
for i in range(1000000):
    d = cut_deck(d)

The result:

[39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 
49, 50, 51, 0, 1, 2, 3, 4, 5, 6, 
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 
37, 38]

Just a single cut!


Thinking a little more careful reveals what’s happening.  The process of the second cut actually reconnects the two cards that were cut in the first cut!

This was a surprise to me. Consecutive cutting doesn’t actually change things that much, only the distance between the most recently cut cards. This is because the next cut actually “heals” the previous cut!

Thinking about it now, it makes sense. But that has not been my long held belief that cutting multiple times results in a shuffled deck.  I admit that I had to pull out a real deck of cards to convince myself that this simulation result wasn’t an artifact.  But it is not.  Try it out.

Was this a surprise to you?  Just about everyone in my house guessed that the deck would be pretty shuffled after a bunch of cuts. Probably a thinking error that magicians exploit.

Digging into Random Forests

More than a year ago, I worked through a few Fast.ai courses:

It was a fantastic experience: an introduction to the nuts and bolts of applying machine intensive analysis to real problems in a way that one could see results almost immediately; a code-based approach (it helped that I already knew lots of Python) that got into the nitty-gritty of solving tough problems; an emphasis on working habits that led to an agile approach; and an amazingly active and collaborative community of students.  Somehow, whatever Jeremy Howard says always makes sense and is steeped in hard-won expertise.  It was a perfect fit for me and I learned a ton.

But honestly, I have not applied the Fast.ai material to my own projects much at all.  I think the simple reason is that there is an inertia I have to overcome when I start a Fast.ai type analysis: securing enough GPU computing power, making sure all the modules are up-to-date and aligned to the project at hand, relearning the processes that Jeremy outlined and adapting them to the project.  While I am sure that many Fast.ai students are plugged in enough to make those obstacles relatively trivial, my available time and resources just aren’t a good match to make all that work.

But I’ve realized that there should be a lot I can do towards implementing a machine learning approach without needing the heavy infrastructure, especially since the problems I’m likely to tackle aren’t necessarily the massive ones targeted in the courses.  Jeremy’s approach covered tools students need to tackle some of the biggest problems – but that brought lots of complexity.  I just need to pare it down to the basics –  Fast.ai on the cheap – and create a learning environment for myself in which I can make some progress.

The projects I have in mind right now would be best served by an exploration of Random Forest methodologies.  That’s where I will start…

 

An empirical exploration of the Central Limit Theorem

The way I understand it, the Central Limit Theorem states that a distribution of sampled means of ANY distribution will be normal.

Well, sort of.

From Wikipedia:

In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a “bell curve”) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

in another place:

The central limit theorem states that the sum of a number of independent and identically distributed random variables with finite variances will tend to a normal distribution as the number of variables grows.

In general, I only start to understand such concepts if I can play around with them.  So let’s see how this works! We will start by defining a function that will perform the sampling to create distributions of means.

In the code below, s in the function getSamp is the name of the distribution that we will be testing.


#preamble 
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

import pandas as pd
import numpy as np

nbins =100
def getSamp(s='normal(5,3,',n=1000,reps=10000):
    ddist = eval('np.random.'+s+'size=n)')
    xb = eval('np.random.'+s+'size=(n,reps))')
    f, (ax1,ax2) = plt.subplots(ncols=2,figsize=(10,5))
    ax1.set_title('Data distribution')
    ax1.hist(ddist,bins=nbins)
    ax2.set_title('Sample distribution of the mean for: '+s+')')
    mb = xb.mean(axis=0)
    return ax2.hist(mb,bins=nbins) # tmp holds the arrays used to calc

tmp = getSamp()

For the default (normal distribution, mean = 5, std. dev = 3), here’s the distribution and the results distribution of means:

Let’s define a list of distributions that we want to run against.


dist = ['normal(5,3,',
        'uniform(low=0,high=5,',
        'binomial(n=100,p=.5,',
        'logistic(',
        'laplace(',
        'exponential(',
        'lognormal(',
        'poisson(',
        'poisson(lam=50,',
        'power(a=2,',
        'power(a=50,',
        'power(a=10000,',
        'wald(mean=5,scale=1,']

Then run it:


for d in dist:
    tmp = getSamp(s=d)

For uniform and binomial distributions:

For exponential and lognormal distributions:

How about a couple of different poisson distributions?

That default poisson yields a normal-looking mean distribution, but clearly the highly discrete nature of the source distribution has an effect on the mean’s distribution.

But overall, it seems to hold that the distribution of the means are normal. (You might want to try other distributions for yourself…)

Alerting fracking companies to disclosure errors

During my regular updates of the raw FracFocus data set, I occasionally come across obvious errors.  This morning, it was a report of over 1 Billion gallons of water used in a single fracking job.  It would be great if there was a mechanism to inform companies of such errors so that they correct them.  A chemist friend tells me he has tried numerous times to communicate with companies about such problems, with no success.  I thought I would give it a try and document my attempts in my FracFocus blog.