Connecting a huge PDF library with a bulk download

As I’ve mentioned before, FracFocus does not include the chemical records from 2011-mid 2013 in their bulk download data, even though they still provide individual PDF files from that period.   This is an important time in the fracking boom and the data set would be considerably poorer without records from those years.  Luckily, the environmental NGO, SkyTruth, scraped the PDFs from the FracFocus website until FracFocus started blocking web crawling.   I use this SkyTruth archive to supplement the bulk download from FracFocus.

A lingering worry I’ve had, however, is that the disclosures from those years have been changed – with no record of the change.  That is, SkyTruth’s archive was valid when they collected the data, but companies made changes in those PDFs silently.  So I’ve begun a months-long project to connect the bulk data to the PDF library.

In the first stage, I am simply using the APINumbers of the SkyTruth archive to see if the FracFocus PDFs still even exist.  While that seems simple enough, the FracFocus search page for the PDFs makes the process pretty tedious.  Not to mention, there are something like 35,000 separate SkyTruth disclosures to test.

The solution I’m using is a combination of the Python version of Selenium (a web site testing facility that works as a ‘headless’ browser) and some scripts that slowly requests the PDFs from the SkyTruth records.   This is necessary because the search page does not allow traditional web scraping – it needs a live browser clicking on buttons to fetch the appropriate PDF files.  Once the thing starts running, a Chrome browser pops up by itself, an APINumber is loaded into the appropriate field, a button click is activated, and a search is performed without any human intervention.  Nice!  My scripts keep the found PDFs (for later comparison to the bulk data) and then moves on to the next APINumber.   I keep the speed low to prevent overloading the FracFocus servers (the original reason SkyTruth and others were locked out) so I get about 4 searches each minute.

Today I finished looking through the SkyTruth PDF search data and indeed found 140 disclosures have been removed from the FracFocus system since SkyTruth did their work.  Why were they deleted?  Maybe simple mistakes?  Seems worth looking into.

The next phase will be to compare the SkyTruth data with the actual data in the current PDFs.  That’s a big project, but I think the job is easier now than when SkyTruth was grappling with it.  I’m looking forward to looking into the Camelot project – a simple PDF table reader.

Should I overhaul Open-FF?

I am contemplating an overhaul of the Open-FF system. This would include:

  • cleanup and simplifying the code
  • improve the documentation
  • add unit testing, so code changes don’t introduce problems to older code
  • improve coverage of chemical mass
  • create hooks for more features such as record correction

This will be lots of work and a significant amount of time.  Before I head into this I want to more explicitly outline what FracFocus (as it is) can  and can’t do for us.

Obstacles

From all of my interactions with FracFocus, I found a host of obvious obstacles to transparency about chemicals in the industry:

  • FracFocus version 1 has no chemical records in the bulk download
  • Proprietary claims are allowed, and often buried in MSDS
  • The ‘systems approach’ breaks the connection between chemical identity and supplier and trade-name product identity, not just on proprietary chemicals but ALL chemicals in a disclosure
  • There is no easy way to determine oilfield service companies that provided for the fracking job.
  • There is no audit trail, there is no record of when a disclosure was published or who published it.
  • There are few checks of data quality (either values or presence).
  • Silent changes are allowed
  • There appears to be no publishing deadline – many are published more than a year after the work is completed.
  • There is no way to determine the “final” disclosure when there are duplicates.  Some may even be partial.
  • FracFocus takes no responsibility for data entered – always refers specific questions to the companies entering the data.
  • Companies are almost universally unresponsive about problems brought to their attention.  For the most part, there is not a clear way for members of the public to alert companies.
  • The number of companies and different actors withing companies is very large.  It is clear there is no “one way to do it” approach to the FracFocus data and different approaches may be contradictory.

To me, this all means that FracFocus provides companies with huge ambiguity and therefore, FracFocus will be useless for any legal challenges.  It has many built-in avenues of deniability.

There is another big ambiguity: my lack of direct knowledge of the industry and all their procedures and my lack of ability to ask for clarification.

Uses?

Given all these problems, what good is FracFocus?

First of all, it is by far the biggest source of public data on fracking chemicals.  Last count put the number of disclosures over 175,000 and number of individual chemical records is over 6 million.  Even if there is a lot of ambiguity, with that much data, larger pictures start to emerge. Pictures about the vast number of chemicals used and the large ranges of quantities; sometimes at eye-popping values.  Patterns of water use – which continues to climb.  Examples of poor disclosure, of chemical hiding techniques, of company hypocrisy.  Patterns of how the industry is changing.  Suggestions on what some of the hidden chemicals are.

As long as we remember that the data will always be less than perfect, have a degree of ambiguity and subject to industry denials because they were so sloppy, there is a lot we can get from FracFocus.

Is that enough for the potential users?  For me, that is still the open question.  Academics want data to be immaculate and if it can’t be that, the ambiguity has to be well documented.  The little interaction I’ve had with state folks left the impression that they don’t use the data for much of anything – just documentation; as long as a disclosure is published, that box is checked.  Activists and public health advocates seem to be overwhelmed with the crazy complexity of these data: the number and complexity of chemicals, range of uses, the multi-layered hiding.  It is frustrating to them.

Open-FF?

Given all that, what can Open-FF offer to help the situation?  I think the one thing major thing that Open-FF can offer (though I am not sure it is there yet) is complete transparency. Opening up the FracFocus black box.  That means, making sure it is easy to understand what I am doing to FF data, making sure the ambiguities of FF are well delineated, the crap of FF clearly spelled out but also that the usefulness of the data established.  But how can I be a source of transparency when I don’t have access to even getting questions answered?

How do I evaluate that for Open-FF?  I guess one big thing would be to move it to a clean-enough form that anyone with python background could take it over.  That a group like EDGI would be able to evaluate it.

Bottom line

FracFocus has always bothered me. That it is a “transparency instrument” that manages to almost completely undermine the public’s knowledge of chemical.  That the industry actually takes credit for being transparent with it.  That its existence largely scuttles serious government oversight.

The public deserves to be able to find out.  If that is my contribution, just adding a bit of transparency to the situation, I’d be happy.

FracFocus geo-projections: is it worth the trouble to standardize them?

 

Every FracFocus disclosure has a latitude/longitude pair with an associated “projection,” the most common are NAD27, NAD83 or WGS84.  I’ve come to understand that a given lat/lon pair, if plotted in the wrong projection, can be off by several meters, sometimes even tens to one hundred meters.  Since I would like to be able to map all the disclosures together, I started looking into converting the FracFocus lat/lons into a single standard projection.

But there were problems…

The first problem was my own ignorance of the subtleties of GIS and projections.  Because I use Python, I found the package PyProj that is very cool and had everything to do the conversions.  Except that when I tried my specific conversions, say NAD27 to WGS84, nothing changed.  Other conversions worked, but not mine!  It turns out that my three targets are not actually projections but rather coordinate systems.  To make it work, I had to do two conversions. For example, first I converted from NAD27 to a bone fide projection such the US National Atlas Equal Area projection.  Then I followed with a conversion from that to WGS84.  Only then was I seeing differences in the converted lat/lon data. That took me too long to figure out.

import pyproj 
nad27 = pyproj.CRS.from_epsg(4267)
inbetween = pyproj.CRS.from_epsg(2163)
wgs = CRS.from_epsg(4326)
t1 = pyproj.Transformer.from_crs(nad27,inbetween,always_xy=True)
t2 = pyproj.Transformer.from_crs(inbetween,wgs,always_xy=True)
lat = 28.35134062
lon = -98.46484402
olon,olat = t1.transform(lon, lat)
xlon,xlat = t2.transform(olon,olat)

[Update: April 2022 – I’m using geopandas.  It is a lot easier than the above!  I will incorporate into version 15 of Open-FF.]

But the real problems are in the FracFocus data themselves.

Understanding latitude/longitude information

It is useful to understand what the number of decimal places in a lat/lon pair can represent.  Here is a very helpful breakdown from gis/stack_exchange by whuber:

…we can construct a table of what each digit in a decimal degree signifies:

  • The sign tells us whether we are north or south, east or west on the globe.
  • A nonzero hundreds digit tells us we’re using longitude, not latitude!
  • The tens digit gives a position to about 1,000 kilometers. It gives us useful information about what continent or ocean we are on.
  • The units digit (one decimal degree) gives a position up to 111 kilometers (60 nautical miles, about 69 miles). It can tell us roughly what large state or country we are in.
  • The first decimal place is worth up to 11.1 km: it can distinguish the position of one large city from a neighboring large city.
  • The second decimal place is worth up to 1.1 km: it can separate one village from the next.
  • The third decimal place is worth up to 110 m: it can identify a large agricultural field or institutional campus.
  • The fourth decimal place is worth up to 11 m: it can identify a parcel of land. It is comparable to the typical accuracy of an uncorrected GPS unit with no interference.
  • The fifth decimal place is worth up to 1.1 m: it distinguish trees from each other. Accuracy to this level with commercial GPS units can only be achieved with differential correction.
  • The sixth decimal place is worth up to 0.11 m: you can use this for laying out structures in detail, for designing landscapes, building roads. It should be more than good enough for tracking movements of glaciers and rivers. This can be achieved by taking painstaking measures with GPS, such as differentially corrected GPS.
  • The seventh decimal place is worth up to 11 mm: this is good for much surveying and is near the limit of what GPS-based techniques can achieve.
  • The eighth decimal place is worth up to 1.1 mm: this is good for charting motions of tectonic plates and movements of volcanoes. Permanent, corrected, constantly-running GPS base stations might be able to achieve this level of accuracy.
  • The ninth decimal place is worth up to 110 microns: we are getting into the range of microscopy. For almost any conceivable application with earth positions, this is overkill and will be more precise than the accuracy of any surveying device.
  • Ten or more decimal places indicates a computer or calculator was used and that no attention was paid to the fact that the extra decimals are useless. Be careful, because unless you are the one reading these numbers off the device, this can indicate low quality processing!

How many digits do fracking companies report?

You might imagine that these high-tech fracking crews will often be using high-tech surveying equipment and likely be measuring to the sixth decimal place.  But what do they actually report?  Here are the numbers of digits to the right of the decimal point for all the latitude and longitude values in the entire FracFocus data set (downloaded Jan. 2021, includes the SkyTruth data):

Number of
decimal digits
counts
0 447
1 1033
2 4751
3 5900
4 23088
5 103280
6 236216
7 41081
8 17838
9 10408
10 1075
11 458
12 3190
13 5456
14 3160
15 2949

On the low end of this distribution, almost 1/3 of the disclosures have fewer than 6 decimal digits, probably the ideal number of digits if we are to see any benefit from translating the projections. Remarkably, more than 10,000 values have 3 or fewer digits, which is the recommended precision to report if you want to protect your privacy!

On the high end, almost 10% of the values have more than 7 digits – likely to be beyond the capability of measurement and just a sign of “low quality processing.”

Measurement consistency

Another indication of the quality of the location data is what is reported in multiple disclosures for the same well.  Over multiple fracking events, the location of the well does not move and we should expect that multiple measurements of that location should be identical. But of course, real measurements are rarely identical because there is always measurement error.    If the reports ARE identical, especially if the numbers of digits are beyond measurement capability, we can assume that both reports come from a single measurement just copied. For example, the following table shows three disclosures for the well API: 33-007-01818-00-00. Clearly, the lat/lon data come from a single measure:

But if they are not identical, we can get an idea of the measurement error.  We would like the differences to be at or below the 6 decimal digits we talked about above.

Latitude difference among 4024 wells 
with multiple disclosures:
  -                           No difference: 47%
  - smaller than 5 decimal digit difference: 14%
  - greater than 5 decimal digit difference: 39%

Longitude difference among 4024 wells
with multiple disclosures:
  -                           No difference: 47%
  - smaller than 5 decimal digit difference: 13%
  - greater than 5 decimal digit difference: 40%

 

Upshot

So the bottom line is that producing a ‘standardized’ projection for all lat/lon data in FracFocus would be a mirage.  Far too many disclosures have only coarse reporting and consistency.  It would be like measuring the distance to the grocery store in millimeters: maybe technically possible, but not useful and probably an illusion.

[UPDATE: There are other reasons than just accuracy in Google Map views to standardize projections.  I’m working on that for version 15.]

Collectinging fracking disclosures into well pads

The FracFocus disclosure data is fundamentally based on specific fracturing events, identified by a specific date when the job is completed and geographic tags (latitude/longitude).  A given well, a unique location, may have more than one job applied to it, and in general, the fracking-event focus of data set gives a highly granular temporal view of chemicals applied by this industry.  In some ways, this wealth of data (currently over 200,000 discrete disclosures including the SkyTruth data) can overwhelm a big picture view.   For example, methanol is used in about 60% of fracking events at a wide range of quantities, from less than one pound to over 100,000 pounds:

One aspect of the industry that is obscured here is that many fracking operations occur on constructed “well pads,” somewhere in the size of a large home lot to about half of a soccer field.  The pads may have only a single well, but often and more frequently, they may have several or even dozens of wells!

From some perspectives, for example for folks who live near a well pad or analysts trying to ascertain local impacts of fracking, aggregating data from all wells on a pad makes much more sense: looking at the forest instead of the trees.  Unfortunately, there is no”well-pad id” field in the FracFocus data.  So I set out to create one.

The process I use is a combination of heavy automated analysis and manual tweaking. Simply, the first phase was to group all disclosures into tight geographic clusters (using scikit-learn’s DBSCAN – see this example).  The settings I used gave me about 100,000 clusters, the majority being a single well on a pad.  This worked fairly well, but I needed a way to manually verify that I was actually catching well pads and not some other geographic grouping.  So I used the Google Staticmap API to generate a satellite image with the locations of wells overlaid in it:

If the satellite image that Google uses is recent enough, the constructed well pad will be clearly visible where the well markers show up.  Actually, the agreement of these two data sources (Google Maps and FracFocus’s Lat/Lon data) was a very happy surprise.  To better understand (and maybe adjust) the clustering, for each cluster, I created an “outside” set of disclosures that were close but not grouped in the cluster and had the Google API show the outside wells, too:

Instead of trying to manually look at all 100,000 images, I sorted the prospective well pad groupings by the distance to the closest “outside” well.  Most single wells are geographically separated enough that I didn’t need to visually check them. Whew! I am in the process of doing this now.

Once I have these pad identifiers, I’m looking forward to seeing the pad-level quantities of chemicals used.

Surveillance of a black-box data process

In my work with FracFocus, the oil & gas industry’s chemical disclosure instrument, I have had to come up with custom methods to learn about the data because the documentation is meager, the data organization is poor and the maintainers of the website are not very forthcoming when I ask questions.

Even though I’ve worked with these raw data for almost two years, one aspect has remained particularly mysterious: silent changes to already published disclosures.  This occurs when the information in a published disclosure changes without any record of the change.  For example, you may look up the chemicals used in a fracking event near your home and months later, when you check again, some of the information is different than the first time you looked.  It might be changed just a bit so you don’t even notice, or one or more values might be very different, but there is no indication when or why it changed.

It is not suspicious that a company might need to update information it has published: small mistakes can creep into reports and sometimes go unnoticed until after publication happens.  But a well-designed disclosure instrument should have some audit trail for changes made to published information. Indeed, I was told by the FracFocus team that any changes to published disclosures must be made in a NEW disclosure.  In some cases, that happens in FracFocus (though, incidentally, I was also told that these new disclosures are sometimes used to just add to previous data, not to replace it, further undermining their usefulness).

But when I was evaluating an older archive of FracFocus data from the organization SkyTruth, I came across an odd situation: the data in the old archive was mostly a very accurate representation of currently published disclosures, except that some values were very different. Clearly, the data had been changed since the old archive was created.  However, there was no audit trail, no new disclosure, no sign that the data had been changed.  I was only able to compare a few metadata values (the volume of water used and the geo-coordinates); the chemical data for those events are not available in the current publications.

One could just assume that the changes just represent the companies fixing mistakes to make the newer data more correct.  However, that might be naive: just recently I came across a report from 2014 that claims that there were several silent changes in early FracFocus records to obscure the use of diesel in fracking operations.  Diesel is just about the ONLY chemical that is regulated in fracking.  Without a record of what changed, we must completely trust the companies.

So, I have started comparing archives –  a current one with an earlier one – to see if previously published data has been silently changed.  It is not clear at all that I will find anything of interest.  I can only look at changes since I started saving raw downloads (late 2018), and I suspect that most of the changes will be operators making minor changes to publications instead of creating a whole new disclosure.  Still, it gives me a little more confidence that we can shine more light onto the processes of this industry-sponsored operation and overcome some of its built-in weaknesses.

query-FF: a CodeOcean project to ease FracFocus exploration

I’ve started another CodeOcean project related to open-FF. Whereas open-FF creates research-grade data sets from the FracFocus website, query-FF lets any user explore those data without being  a programmer.  While it may not be as easy as a point-and-click operation, users can write small scripts based on examples in the documentation to build custom data sets and to perform basic explorations.

Here is a simple example:

This set of commands, executed directly in the CodeOcean environment, makes a data set for 2-butoxy-ethanol for fracking events in Ohio for the years 2016-18.

Furthermore, because the CodeOcean environment is a full-blown anaconda/python/pandas environment, the user can write code directly to access the intricacies of the data without going through all the trouble of setting up their own software environment.

Hopefully, this will make the fracking chemical data accessible to many more people.

Scrubbing the black off of coal: Grappling with sub-standard data

When my kids were young, we often read “Gullible’s Troubles” by Margaret Shannon.  The protagonist of the book is a guinea pig named Gullible that believes anything he’s told.  At one point, his uncle tells him to clean the black off of a pile of coal.  Gullible dutifully scrubs and scrubs each lump and is shocked when there is nothing left.

From Gullible's Troubles by Margaret Shannon

From Gullible’s Troubles by Margaret Shannon

Sometimes when I work to clean poor quality data, I feel like Gullible: is there going to be anything worthwhile left when I am finished cleaning?

My most recent extended data project is a good example.  The FracFocus website is touted by the fracking industry to be the primary resource of chemical disclosures for a large fraction of fracking jobs in the US since 2011.  Unfortunately, it is barely more than a loose collection of disclosure forms, in multiple formats (barely documented), with little error checking and often with missing values.

When I first discovered these disclosures, I was amazed at the breadth of the coverage: about 150,000 fracking events in most active areas in the US, something like 4,000,000 individual chemical records.   When I began to make simple plots, it was intriguing, but there were also a lot of disclosures with missing values.  When I started to examine individual chemicals, I noticed that there were many that were labeled incorrectly or ambiguously.  So, I started writing routines to either clean up the problems or at least flag them so I could focus on the good data.

As months passed by, more and more weakness of the data set emerged and I had the impression that I was correcting or flagging just about everything.  That’s when I started thinking about Gullible’s Troubles.  My original intention was to transform these disclosures into a usable research data set, but I started to wonder if I would be left with just a long list of problems.  Certainly, something like that is useful as a critique of a slipshod disclosure instrument, but could I say nothing about the big picture of fracking chemicals in the US?

The answer, fortunately, is that there is a huge amount one can learn from these data, especially when considering the big picture of chemical use.  While it may be hard to make definitive claims about specific fracking events, especially when event data are spotty or suspicious, there are so many fracking events in the data set, spread over so many operating companies and oilfield service companies, that larger patterns emerge.  For example, we may not be able to know for certain what mass of methanol (CAS RN: 67-56-1) was used for a particular fracking operation, but there are over 150,000 individual records of its use in the data set.  So, for example, we can compare quantities that different companies use:

We find, for example, that the largest 10% of uses are over 7000 pounds and that it is not that unusual to see uses over 100,000 pounds.  Could we have learned that from any other source?

This disclosure instrument should be much cleaner than it is and it should be far more accessible to interested parties.  Many critiques of FracFocus correctly identify weaknesses and even severe problems with it.  One could easily come away with the impression that it is a waste of time to investigate.  But, it is not useless; with some work, we can pull important insights from these sub-standard data.

Sometimes it pays to be a bit gullible.

Alerting fracking companies to disclosure errors

During my regular updates of the raw FracFocus data set, I occasionally come across obvious errors.  This morning, it was a report of over 1 Billion gallons of water used in a single fracking job.  It would be great if there was a mechanism to inform companies of such errors so that they correct them.  A chemist friend tells me he has tried numerous times to communicate with companies about such problems, with no success.  I thought I would give it a try and document my attempts in my FracFocus blog.