The FracFocus disclosure data is fundamentally based on specific fracturing events, identified by a specific date when the job is completed and geographic tags (latitude/longitude). A given well, a unique location, may have more than one job applied to it, and in general, the fracking-event focus of data set gives a highly granular temporal view of chemicals applied by this industry. In some ways, this wealth of data (currently over 200,000 discrete disclosures including the SkyTruth data) can overwhelm a big picture view. For example, methanol is used in about 60% of fracking events at a wide range of quantities, from less than one pound to over 100,000 pounds:
One aspect of the industry that is obscured here is that many fracking operations occur on constructed “well pads,” somewhere in the size of a large home lot to about half of a soccer field. The pads may have only a single well, but often and more frequently, they may have several or even dozens of wells!
From some perspectives, for example for folks who live near a well pad or analysts trying to ascertain local impacts of fracking, aggregating data from all wells on a pad makes much more sense: looking at the forest instead of the trees. Unfortunately, there is no”well-pad id” field in the FracFocus data. So I set out to create one.
The process I use is a combination of heavy automated analysis and manual tweaking. Simply, the first phase was to group all disclosures into tight geographic clusters (using scikit-learn’s DBSCAN – see this example). The settings I used gave me about 100,000 clusters, the majority being a single well on a pad. This worked fairly well, but I needed a way to manually verify that I was actually catching well pads and not some other geographic grouping. So I used the Google Staticmap API to generate a satellite image with the locations of wells overlaid in it:
If the satellite image that Google uses is recent enough, the constructed well pad will be clearly visible where the well markers show up. Actually, the agreement of these two data sources (Google Maps and FracFocus’s Lat/Lon data) was a very happy surprise. To better understand (and maybe adjust) the clustering, for each cluster, I created an “outside” set of disclosures that were close but not grouped in the cluster and had the Google API show the outside wells, too:
Instead of trying to manually look at all 100,000 images, I sorted the prospective well pad groupings by the distance to the closest “outside” well. Most single wells are geographically separated enough that I didn’t need to visually check them. Whew! I am in the process of doing this now.
Once I have these pad identifiers, I’m looking forward to seeing the pad-level quantities of chemicals used.