Dealing with changes in open-source packages

Now that the Open-FF Browser is live and used by many, I’ve run into a odd difficulty.  I have based a lot of the “informal” search capabilities of the browser on a table-based element called “iTables.”  This is a python wrapper around the javascript package called “DataTables.”  I have found both to be extremely useful and they made search within a static web space very effective.

However, iTables recently added a lot functionality to its package (some of which I use) and it is now in version 2.0.  In the process of this change, the Search function is less responsive.    That is, when the displayed table is greater than about 10,000 records, Search grinds to a halt or at least a very slow crawl.

Most of the browser tables are shorter than that, but for some — Texas or even, Weld County, Colorado, for instance — the search function is useless or worse, it crashes the browser.

It turns out that earlier versions of iTables were similarly limited (as expected) but that the maximum size of the table I could search was within the browser table sizes.  But no more.

This is forcing me to think about new ways of providing search functionality to users.  I believe I may have to move some searches to Colab notebooks or perhaps even look into Bigquery-type interfaces.   The Browser has always been based on a static website model, for simplicity sake, and I’m not sure which new path to take.

Examining past disclosures in FracFocus

As I have mentioned in this blog, FracFocus does not maintain an audit trail as companies change already-published disclosures.  In fact, such changes are made without public notice or justification.

One way to see what those changes are would be to compare older archived data with more recent data.  While FracFocus only supplies bulk data for the current data state, we have been saving these bulk downloads since Fall 2018 and currently have around 250 separate downloads.  We are working on a resource that researchers can use to delve into that older data.

Number of disclosures in set of archived downloads

The two obvious research questions for these older data are: 1) what are the silent changes happening in the data and 2) what is the publication delay from the end of the fracking job to the appearance in FracFocus.   This latter question is interesting because many states have publication requirements of 30 or 60 days, and yet there are many cases where the delay is more like a few years.

In the FracFocus bulk download, there are two types of zip files. In the first type, the entire data set is split into roughly 25 zipped CSV files – these include all the chemical records.  The second type of zip file contains only the meta data – one line for each disclosure.  To create a useful “index” to all the archived disclosures, we use this latter set to create an easier way to compare each separate download.  Although the entire archive is too large to store online, this index will be available.

 

Connecting a huge PDF library with a bulk download

As I’ve mentioned before, FracFocus does not include the chemical records from 2011-mid 2013 in their bulk download data, even though they still provide individual PDF files from that period.   This is an important time in the fracking boom and the data set would be considerably poorer without records from those years.  Luckily, the environmental NGO, SkyTruth, scraped the PDFs from the FracFocus website until FracFocus started blocking web crawling.   I use this SkyTruth archive to supplement the bulk download from FracFocus.

A lingering worry I’ve had, however, is that the disclosures from those years have been changed – with no record of the change.  That is, SkyTruth’s archive was valid when they collected the data, but companies made changes in those PDFs silently.  So I’ve begun a months-long project to connect the bulk data to the PDF library.

In the first stage, I am simply using the APINumbers of the SkyTruth archive to see if the FracFocus PDFs still even exist.  While that seems simple enough, the FracFocus search page for the PDFs makes the process pretty tedious.  Not to mention, there are something like 35,000 separate SkyTruth disclosures to test.

The solution I’m using is a combination of the Python version of Selenium (a web site testing facility that works as a ‘headless’ browser) and some scripts that slowly requests the PDFs from the SkyTruth records.   This is necessary because the search page does not allow traditional web scraping – it needs a live browser clicking on buttons to fetch the appropriate PDF files.  Once the thing starts running, a Chrome browser pops up by itself, an APINumber is loaded into the appropriate field, a button click is activated, and a search is performed without any human intervention.  Nice!  My scripts keep the found PDFs (for later comparison to the bulk data) and then moves on to the next APINumber.   I keep the speed low to prevent overloading the FracFocus servers (the original reason SkyTruth and others were locked out) so I get about 4 searches each minute.

Today I finished looking through the SkyTruth PDF search data and indeed found 140 disclosures have been removed from the FracFocus system since SkyTruth did their work.  Why were they deleted?  Maybe simple mistakes?  Seems worth looking into.

The next phase will be to compare the SkyTruth data with the actual data in the current PDFs.  That’s a big project, but I think the job is easier now than when SkyTruth was grappling with it.  I’m looking forward to looking into the Camelot project – a simple PDF table reader.

What is the Open-FF curation cost?

My work to clean up FracFocus requires curation. By that, I mean manually looking at data and making decisions about how to deal with it based on its values.  Data analysts have to do that all the time: is this value an outlier? should I transform the data? Should I drop these values because they seem to be not collected in the same way as the rest?

The work on FracFocus is different in at least two ways: First, I am not making decisions for a particular analysis goal, but rather just the goal of producing a “clean set” for unspecified future analysis.  And second, the data set is constantly growing so that I have to either 1) make decisions that may affect future (unknown) data so that I can “reuse” my decisions for older data on new materials or 2) curate the entire data set every time it is updated.  I’m currently working on how to deal with CAS registry numbers, and that is a good example of both issues.

Simply put, CAS numbers are codes that uniquely identify chemical compounds, whereas the “names” we give chemicals are rife with ambiguity (for instance, there are dozens of ways of naming 144-55-8: baking soda).  FracFocus wisely tells end-users to pay attention to the CAS number and not the name, because the FracFocus data are messy.  And because it is messy, we must either throw data out that doesn’t conform or try to correct it.

So for the current data set (March 2021), even though there are only about 1,400 actual unique chemicals, there are about 3900 unique values in the field CASNumber that have to be curated – that is, decide on our best guess for the “real” CAS number for that raw CASNumber.  For a large proportion of those 3,900, the answer is simple: either they match an authoritative CAS number exactly or they are clearly not even CAS numbers. For those, the curation is easy and any new data added in the future won’t change what I decide for the current values.  But some require cleanup before they match a reference CAS number.  Some require an interpretation that is data dependent.  The CAS number for water is ‘7732-18-5’ and there are scads of those in the data.  However, there are also some ‘7332-18-5′ versions, that after examining individual records, are clearly also water.  Do we make all future versions of 7332-18-5 into water?  I wouldn’t feel good about that.  However, if we also had the IngredientName and it was ‘water’  for all those future versions, I would be comfortable making that call.

But throwing the second variable into the curation process adds a huge curation cost.  We go from 3,900 unique CASNumber values to 33,000 unique CASNumber/IngredientName pairs. Whew!  That’s a lot of labor, though we can probably cut down how much it requires by only focusing on the CAS Numbers that are ambiguous.

Ok, lets say that the first curation task is manageable; even though it  might take a week of work, if you only have to do it once, that’s doable.   But there is also the question about how much ongoing work would a CASNumber/IngredientName evaluation take?  FracFocus is an ever expanding entity and we need to grapple with the future work, too.

Currently, it looks like about 50 new combinations a week.  In 2019, the rate was higher and it is likely to go back to that, at least.   That seems reasonable, if there are tools in place that make the regular curation task simple.  Especially if I can make the default to not accept the record and to make the rules very clear.  I am willing to give it a try.  Going back to a simpler curation task (just using CASNumber) is relatively easy.

Should I overhaul Open-FF?

I am contemplating an overhaul of the Open-FF system. This would include:

  • cleanup and simplifying the code
  • improve the documentation
  • add unit testing, so code changes don’t introduce problems to older code
  • improve coverage of chemical mass
  • create hooks for more features such as record correction

This will be lots of work and a significant amount of time.  Before I head into this I want to more explicitly outline what FracFocus (as it is) can  and can’t do for us.

Obstacles

From all of my interactions with FracFocus, I found a host of obvious obstacles to transparency about chemicals in the industry:

  • FracFocus version 1 has no chemical records in the bulk download
  • Proprietary claims are allowed, and often buried in MSDS
  • The ‘systems approach’ breaks the connection between chemical identity and supplier and trade-name product identity, not just on proprietary chemicals but ALL chemicals in a disclosure
  • There is no easy way to determine oilfield service companies that provided for the fracking job.
  • There is no audit trail, there is no record of when a disclosure was published or who published it.
  • There are few checks of data quality (either values or presence).
  • Silent changes are allowed
  • There appears to be no publishing deadline – many are published more than a year after the work is completed.
  • There is no way to determine the “final” disclosure when there are duplicates.  Some may even be partial.
  • FracFocus takes no responsibility for data entered – always refers specific questions to the companies entering the data.
  • Companies are almost universally unresponsive about problems brought to their attention.  For the most part, there is not a clear way for members of the public to alert companies.
  • The number of companies and different actors withing companies is very large.  It is clear there is no “one way to do it” approach to the FracFocus data and different approaches may be contradictory.

To me, this all means that FracFocus provides companies with huge ambiguity and therefore, FracFocus will be useless for any legal challenges.  It has many built-in avenues of deniability.

There is another big ambiguity: my lack of direct knowledge of the industry and all their procedures and my lack of ability to ask for clarification.

Uses?

Given all these problems, what good is FracFocus?

First of all, it is by far the biggest source of public data on fracking chemicals.  Last count put the number of disclosures over 175,000 and number of individual chemical records is over 6 million.  Even if there is a lot of ambiguity, with that much data, larger pictures start to emerge. Pictures about the vast number of chemicals used and the large ranges of quantities; sometimes at eye-popping values.  Patterns of water use – which continues to climb.  Examples of poor disclosure, of chemical hiding techniques, of company hypocrisy.  Patterns of how the industry is changing.  Suggestions on what some of the hidden chemicals are.

As long as we remember that the data will always be less than perfect, have a degree of ambiguity and subject to industry denials because they were so sloppy, there is a lot we can get from FracFocus.

Is that enough for the potential users?  For me, that is still the open question.  Academics want data to be immaculate and if it can’t be that, the ambiguity has to be well documented.  The little interaction I’ve had with state folks left the impression that they don’t use the data for much of anything – just documentation; as long as a disclosure is published, that box is checked.  Activists and public health advocates seem to be overwhelmed with the crazy complexity of these data: the number and complexity of chemicals, range of uses, the multi-layered hiding.  It is frustrating to them.

Open-FF?

Given all that, what can Open-FF offer to help the situation?  I think the one thing major thing that Open-FF can offer (though I am not sure it is there yet) is complete transparency. Opening up the FracFocus black box.  That means, making sure it is easy to understand what I am doing to FF data, making sure the ambiguities of FF are well delineated, the crap of FF clearly spelled out but also that the usefulness of the data established.  But how can I be a source of transparency when I don’t have access to even getting questions answered?

How do I evaluate that for Open-FF?  I guess one big thing would be to move it to a clean-enough form that anyone with python background could take it over.  That a group like EDGI would be able to evaluate it.

Bottom line

FracFocus has always bothered me. That it is a “transparency instrument” that manages to almost completely undermine the public’s knowledge of chemical.  That the industry actually takes credit for being transparent with it.  That its existence largely scuttles serious government oversight.

The public deserves to be able to find out.  If that is my contribution, just adding a bit of transparency to the situation, I’d be happy.

Simple addition to a Jupyter table

A colleague recently commented on a report that I generated that it would be way more user-friendly to make the tables interactive.   Sigh.  Of  course, he is right.  When you come across big tables, most people will just scroll right past – even if they are interested in the content.  It is just too much trouble to digest.

Based on his suggestion, I went on a search for HTML or JavaScript tools that might help me do something like that.  What I stumbled upon has been super useful not just for reports, but just about every Jupyter script I write.

It is the itables module for Python and pandas.  By just adding a few lines of code, in every dataframe you display, column names can be clicked to sort them, tables can be broken in to pages, and an awesome search bar is provided to make filtering on the fly a snap.

FracFocus geo-projections: is it worth the trouble to standardize them?

 

Every FracFocus disclosure has a latitude/longitude pair with an associated “projection,” the most common are NAD27, NAD83 or WGS84.  I’ve come to understand that a given lat/lon pair, if plotted in the wrong projection, can be off by several meters, sometimes even tens to one hundred meters.  Since I would like to be able to map all the disclosures together, I started looking into converting the FracFocus lat/lons into a single standard projection.

But there were problems…

The first problem was my own ignorance of the subtleties of GIS and projections.  Because I use Python, I found the package PyProj that is very cool and had everything to do the conversions.  Except that when I tried my specific conversions, say NAD27 to WGS84, nothing changed.  Other conversions worked, but not mine!  It turns out that my three targets are not actually projections but rather coordinate systems.  To make it work, I had to do two conversions. For example, first I converted from NAD27 to a bone fide projection such the US National Atlas Equal Area projection.  Then I followed with a conversion from that to WGS84.  Only then was I seeing differences in the converted lat/lon data. That took me too long to figure out.

import pyproj 
nad27 = pyproj.CRS.from_epsg(4267)
inbetween = pyproj.CRS.from_epsg(2163)
wgs = CRS.from_epsg(4326)
t1 = pyproj.Transformer.from_crs(nad27,inbetween,always_xy=True)
t2 = pyproj.Transformer.from_crs(inbetween,wgs,always_xy=True)
lat = 28.35134062
lon = -98.46484402
olon,olat = t1.transform(lon, lat)
xlon,xlat = t2.transform(olon,olat)

[Update: April 2022 – I’m using geopandas.  It is a lot easier than the above!  I will incorporate into version 15 of Open-FF.]

But the real problems are in the FracFocus data themselves.

Understanding latitude/longitude information

It is useful to understand what the number of decimal places in a lat/lon pair can represent.  Here is a very helpful breakdown from gis/stack_exchange by whuber:

…we can construct a table of what each digit in a decimal degree signifies:

  • The sign tells us whether we are north or south, east or west on the globe.
  • A nonzero hundreds digit tells us we’re using longitude, not latitude!
  • The tens digit gives a position to about 1,000 kilometers. It gives us useful information about what continent or ocean we are on.
  • The units digit (one decimal degree) gives a position up to 111 kilometers (60 nautical miles, about 69 miles). It can tell us roughly what large state or country we are in.
  • The first decimal place is worth up to 11.1 km: it can distinguish the position of one large city from a neighboring large city.
  • The second decimal place is worth up to 1.1 km: it can separate one village from the next.
  • The third decimal place is worth up to 110 m: it can identify a large agricultural field or institutional campus.
  • The fourth decimal place is worth up to 11 m: it can identify a parcel of land. It is comparable to the typical accuracy of an uncorrected GPS unit with no interference.
  • The fifth decimal place is worth up to 1.1 m: it distinguish trees from each other. Accuracy to this level with commercial GPS units can only be achieved with differential correction.
  • The sixth decimal place is worth up to 0.11 m: you can use this for laying out structures in detail, for designing landscapes, building roads. It should be more than good enough for tracking movements of glaciers and rivers. This can be achieved by taking painstaking measures with GPS, such as differentially corrected GPS.
  • The seventh decimal place is worth up to 11 mm: this is good for much surveying and is near the limit of what GPS-based techniques can achieve.
  • The eighth decimal place is worth up to 1.1 mm: this is good for charting motions of tectonic plates and movements of volcanoes. Permanent, corrected, constantly-running GPS base stations might be able to achieve this level of accuracy.
  • The ninth decimal place is worth up to 110 microns: we are getting into the range of microscopy. For almost any conceivable application with earth positions, this is overkill and will be more precise than the accuracy of any surveying device.
  • Ten or more decimal places indicates a computer or calculator was used and that no attention was paid to the fact that the extra decimals are useless. Be careful, because unless you are the one reading these numbers off the device, this can indicate low quality processing!

How many digits do fracking companies report?

You might imagine that these high-tech fracking crews will often be using high-tech surveying equipment and likely be measuring to the sixth decimal place.  But what do they actually report?  Here are the numbers of digits to the right of the decimal point for all the latitude and longitude values in the entire FracFocus data set (downloaded Jan. 2021, includes the SkyTruth data):

Number of
decimal digits
counts
0 447
1 1033
2 4751
3 5900
4 23088
5 103280
6 236216
7 41081
8 17838
9 10408
10 1075
11 458
12 3190
13 5456
14 3160
15 2949

On the low end of this distribution, almost 1/3 of the disclosures have fewer than 6 decimal digits, probably the ideal number of digits if we are to see any benefit from translating the projections. Remarkably, more than 10,000 values have 3 or fewer digits, which is the recommended precision to report if you want to protect your privacy!

On the high end, almost 10% of the values have more than 7 digits – likely to be beyond the capability of measurement and just a sign of “low quality processing.”

Measurement consistency

Another indication of the quality of the location data is what is reported in multiple disclosures for the same well.  Over multiple fracking events, the location of the well does not move and we should expect that multiple measurements of that location should be identical. But of course, real measurements are rarely identical because there is always measurement error.    If the reports ARE identical, especially if the numbers of digits are beyond measurement capability, we can assume that both reports come from a single measurement just copied. For example, the following table shows three disclosures for the well API: 33-007-01818-00-00. Clearly, the lat/lon data come from a single measure:

But if they are not identical, we can get an idea of the measurement error.  We would like the differences to be at or below the 6 decimal digits we talked about above.

Latitude difference among 4024 wells 
with multiple disclosures:
  -                           No difference: 47%
  - smaller than 5 decimal digit difference: 14%
  - greater than 5 decimal digit difference: 39%

Longitude difference among 4024 wells
with multiple disclosures:
  -                           No difference: 47%
  - smaller than 5 decimal digit difference: 13%
  - greater than 5 decimal digit difference: 40%

 

Upshot

So the bottom line is that producing a ‘standardized’ projection for all lat/lon data in FracFocus would be a mirage.  Far too many disclosures have only coarse reporting and consistency.  It would be like measuring the distance to the grocery store in millimeters: maybe technically possible, but not useful and probably an illusion.

[UPDATE: There are other reasons than just accuracy in Google Map views to standardize projections.  I’m working on that for version 15.]

Critically Endangered: Pycnopodia helianthoides

G. Allison

I just learned the sad news that the sunflower sea star, Pycnopodia helianthoides, has been listed as critically endangered by the IUCN. A massive survey lead by Oregon State University researchers has found that their population has collapsed and is showing little sign of recovery.

During my research stint in Oregon in the 1990’s, it was a special treat to come across these spectacular animals.   They weren’t uncommon, but they inhabited the lowest zone of the intertidal and so were not exposed to me very often.  They are voracious predators, often targeting sea urchins.

In 2013, the species, along with many other sea star species, was devastated by an epidemic, the Sea Star Wasting Syndrome.  While most other species of stars have begun to rebound, P. helianthoides has not.

 

Collectinging fracking disclosures into well pads

The FracFocus disclosure data is fundamentally based on specific fracturing events, identified by a specific date when the job is completed and geographic tags (latitude/longitude).  A given well, a unique location, may have more than one job applied to it, and in general, the fracking-event focus of data set gives a highly granular temporal view of chemicals applied by this industry.  In some ways, this wealth of data (currently over 200,000 discrete disclosures including the SkyTruth data) can overwhelm a big picture view.   For example, methanol is used in about 60% of fracking events at a wide range of quantities, from less than one pound to over 100,000 pounds:

One aspect of the industry that is obscured here is that many fracking operations occur on constructed “well pads,” somewhere in the size of a large home lot to about half of a soccer field.  The pads may have only a single well, but often and more frequently, they may have several or even dozens of wells!

From some perspectives, for example for folks who live near a well pad or analysts trying to ascertain local impacts of fracking, aggregating data from all wells on a pad makes much more sense: looking at the forest instead of the trees.  Unfortunately, there is no”well-pad id” field in the FracFocus data.  So I set out to create one.

The process I use is a combination of heavy automated analysis and manual tweaking. Simply, the first phase was to group all disclosures into tight geographic clusters (using scikit-learn’s DBSCAN – see this example).  The settings I used gave me about 100,000 clusters, the majority being a single well on a pad.  This worked fairly well, but I needed a way to manually verify that I was actually catching well pads and not some other geographic grouping.  So I used the Google Staticmap API to generate a satellite image with the locations of wells overlaid in it:

If the satellite image that Google uses is recent enough, the constructed well pad will be clearly visible where the well markers show up.  Actually, the agreement of these two data sources (Google Maps and FracFocus’s Lat/Lon data) was a very happy surprise.  To better understand (and maybe adjust) the clustering, for each cluster, I created an “outside” set of disclosures that were close but not grouped in the cluster and had the Google API show the outside wells, too:

Instead of trying to manually look at all 100,000 images, I sorted the prospective well pad groupings by the distance to the closest “outside” well.  Most single wells are geographically separated enough that I didn’t need to visually check them. Whew! I am in the process of doing this now.

Once I have these pad identifiers, I’m looking forward to seeing the pad-level quantities of chemicals used.

Surveillance of a black-box data process

In my work with FracFocus, the oil & gas industry’s chemical disclosure instrument, I have had to come up with custom methods to learn about the data because the documentation is meager, the data organization is poor and the maintainers of the website are not very forthcoming when I ask questions.

Even though I’ve worked with these raw data for almost two years, one aspect has remained particularly mysterious: silent changes to already published disclosures.  This occurs when the information in a published disclosure changes without any record of the change.  For example, you may look up the chemicals used in a fracking event near your home and months later, when you check again, some of the information is different than the first time you looked.  It might be changed just a bit so you don’t even notice, or one or more values might be very different, but there is no indication when or why it changed.

It is not suspicious that a company might need to update information it has published: small mistakes can creep into reports and sometimes go unnoticed until after publication happens.  But a well-designed disclosure instrument should have some audit trail for changes made to published information. Indeed, I was told by the FracFocus team that any changes to published disclosures must be made in a NEW disclosure.  In some cases, that happens in FracFocus (though, incidentally, I was also told that these new disclosures are sometimes used to just add to previous data, not to replace it, further undermining their usefulness).

But when I was evaluating an older archive of FracFocus data from the organization SkyTruth, I came across an odd situation: the data in the old archive was mostly a very accurate representation of currently published disclosures, except that some values were very different. Clearly, the data had been changed since the old archive was created.  However, there was no audit trail, no new disclosure, no sign that the data had been changed.  I was only able to compare a few metadata values (the volume of water used and the geo-coordinates); the chemical data for those events are not available in the current publications.

One could just assume that the changes just represent the companies fixing mistakes to make the newer data more correct.  However, that might be naive: just recently I came across a report from 2014 that claims that there were several silent changes in early FracFocus records to obscure the use of diesel in fracking operations.  Diesel is just about the ONLY chemical that is regulated in fracking.  Without a record of what changed, we must completely trust the companies.

So, I have started comparing archives –  a current one with an earlier one – to see if previously published data has been silently changed.  It is not clear at all that I will find anything of interest.  I can only look at changes since I started saving raw downloads (late 2018), and I suspect that most of the changes will be operators making minor changes to publications instead of creating a whole new disclosure.  Still, it gives me a little more confidence that we can shine more light onto the processes of this industry-sponsored operation and overcome some of its built-in weaknesses.