Scrubbing the black off of coal: Grappling with sub-standard data

When my kids were young, we often read “Gullible’s Troubles” by Margaret Shannon.  The protagonist of the book is a guinea pig named Gullible that believes anything he’s told.  At one point, his uncle tells him to clean the black off of a pile of coal.  Gullible dutifully scrubs and scrubs each lump and is shocked when there is nothing left.

From Gullible's Troubles by Margaret Shannon

From Gullible’s Troubles by Margaret Shannon

Sometimes when I work to clean poor quality data, I feel like Gullible: is there going to be anything worthwhile left when I am finished cleaning?

My most recent extended data project is a good example.  The FracFocus website is touted by the fracking industry to be the primary resource of chemical disclosures for a large fraction of fracking jobs in the US since 2011.  Unfortunately, it is barely more than a loose collection of disclosure forms, in multiple formats (barely documented), with little error checking and often with missing values.

When I first discovered these disclosures, I was amazed at the breadth of the coverage: about 150,000 fracking events in most active areas in the US, something like 4,000,000 individual chemical records.   When I began to make simple plots, it was intriguing, but there were also a lot of disclosures with missing values.  When I started to examine individual chemicals, I noticed that there were many that were labeled incorrectly or ambiguously.  So, I started writing routines to either clean up the problems or at least flag them so I could focus on the good data.

As months passed by, more and more weakness of the data set emerged and I had the impression that I was correcting or flagging just about everything.  That’s when I started thinking about Gullible’s Troubles.  My original intention was to transform these disclosures into a usable research data set, but I started to wonder if I would be left with just a long list of problems.  Certainly, something like that is useful as a critique of a slipshod disclosure instrument, but could I say nothing about the big picture of fracking chemicals in the US?

The answer, fortunately, is that there is a huge amount one can learn from these data, especially when considering the big picture of chemical use.  While it may be hard to make definitive claims about specific fracking events, especially when event data are spotty or suspicious, there are so many fracking events in the data set, spread over so many operating companies and oilfield service companies, that larger patterns emerge.  For example, we may not be able to know for certain what mass of methanol (CAS RN: 67-56-1) was used for a particular fracking operation, but there are over 150,000 individual records of its use in the data set.  So, for example, we can compare quantities that different companies use:

We find, for example, that the largest 10% of uses are over 7000 pounds and that it is not that unusual to see uses over 100,000 pounds.  Could we have learned that from any other source?

This disclosure instrument should be much cleaner than it is and it should be far more accessible to interested parties.  Many critiques of FracFocus correctly identify weaknesses and even severe problems with it.  One could easily come away with the impression that it is a waste of time to investigate.  But, it is not useless; with some work, we can pull important insights from these sub-standard data.

Sometimes it pays to be a bit gullible.