FracFocus geo-projections: is it worth the trouble to standardize them?

 

Every FracFocus disclosure has a latitude/longitude pair with an associated “projection,” the most common are NAD27, NAD83 or WGS84.  I’ve come to understand that a given lat/lon pair, if plotted in the wrong projection, can be off by several meters, sometimes even tens to one hundred meters.  Since I would like to be able to map all the disclosures together, I started looking into converting the FracFocus lat/lons into a single standard projection.

But there were problems…

The first problem was my own ignorance of the subtleties of GIS and projections.  Because I use Python, I found the package PyProj that is very cool and had everything to do the conversions.  Except that when I tried my specific conversions, say NAD27 to WGS84, nothing changed.  Other conversions worked, but not mine!  It turns out that my three targets are not actually projections but rather coordinate systems.  To make it work, I had to do two conversions. For example, first I converted from NAD27 to a bone fide projection such the US National Atlas Equal Area projection.  Then I followed with a conversion from that to WGS84.  Only then was I seeing differences in the converted lat/lon data. That took me too long to figure out.

import pyproj 
nad27 = pyproj.CRS.from_epsg(4267)
inbetween = pyproj.CRS.from_epsg(2163)
wgs = CRS.from_epsg(4326)
t1 = pyproj.Transformer.from_crs(nad27,inbetween,always_xy=True)
t2 = pyproj.Transformer.from_crs(inbetween,wgs,always_xy=True)
lat = 28.35134062
lon = -98.46484402
olon,olat = t1.transform(lon, lat)
xlon,xlat = t2.transform(olon,olat)

[Update: April 2022 – I’m using geopandas.  It is a lot easier than the above!  I will incorporate into version 15 of Open-FF.]

But the real problems are in the FracFocus data themselves.

Understanding latitude/longitude information

It is useful to understand what the number of decimal places in a lat/lon pair can represent.  Here is a very helpful breakdown from gis/stack_exchange by whuber:

…we can construct a table of what each digit in a decimal degree signifies:

  • The sign tells us whether we are north or south, east or west on the globe.
  • A nonzero hundreds digit tells us we’re using longitude, not latitude!
  • The tens digit gives a position to about 1,000 kilometers. It gives us useful information about what continent or ocean we are on.
  • The units digit (one decimal degree) gives a position up to 111 kilometers (60 nautical miles, about 69 miles). It can tell us roughly what large state or country we are in.
  • The first decimal place is worth up to 11.1 km: it can distinguish the position of one large city from a neighboring large city.
  • The second decimal place is worth up to 1.1 km: it can separate one village from the next.
  • The third decimal place is worth up to 110 m: it can identify a large agricultural field or institutional campus.
  • The fourth decimal place is worth up to 11 m: it can identify a parcel of land. It is comparable to the typical accuracy of an uncorrected GPS unit with no interference.
  • The fifth decimal place is worth up to 1.1 m: it distinguish trees from each other. Accuracy to this level with commercial GPS units can only be achieved with differential correction.
  • The sixth decimal place is worth up to 0.11 m: you can use this for laying out structures in detail, for designing landscapes, building roads. It should be more than good enough for tracking movements of glaciers and rivers. This can be achieved by taking painstaking measures with GPS, such as differentially corrected GPS.
  • The seventh decimal place is worth up to 11 mm: this is good for much surveying and is near the limit of what GPS-based techniques can achieve.
  • The eighth decimal place is worth up to 1.1 mm: this is good for charting motions of tectonic plates and movements of volcanoes. Permanent, corrected, constantly-running GPS base stations might be able to achieve this level of accuracy.
  • The ninth decimal place is worth up to 110 microns: we are getting into the range of microscopy. For almost any conceivable application with earth positions, this is overkill and will be more precise than the accuracy of any surveying device.
  • Ten or more decimal places indicates a computer or calculator was used and that no attention was paid to the fact that the extra decimals are useless. Be careful, because unless you are the one reading these numbers off the device, this can indicate low quality processing!

How many digits do fracking companies report?

You might imagine that these high-tech fracking crews will often be using high-tech surveying equipment and likely be measuring to the sixth decimal place.  But what do they actually report?  Here are the numbers of digits to the right of the decimal point for all the latitude and longitude values in the entire FracFocus data set (downloaded Jan. 2021, includes the SkyTruth data):

Number of
decimal digits
counts
0 447
1 1033
2 4751
3 5900
4 23088
5 103280
6 236216
7 41081
8 17838
9 10408
10 1075
11 458
12 3190
13 5456
14 3160
15 2949

On the low end of this distribution, almost 1/3 of the disclosures have fewer than 6 decimal digits, probably the ideal number of digits if we are to see any benefit from translating the projections. Remarkably, more than 10,000 values have 3 or fewer digits, which is the recommended precision to report if you want to protect your privacy!

On the high end, almost 10% of the values have more than 7 digits – likely to be beyond the capability of measurement and just a sign of “low quality processing.”

Measurement consistency

Another indication of the quality of the location data is what is reported in multiple disclosures for the same well.  Over multiple fracking events, the location of the well does not move and we should expect that multiple measurements of that location should be identical. But of course, real measurements are rarely identical because there is always measurement error.    If the reports ARE identical, especially if the numbers of digits are beyond measurement capability, we can assume that both reports come from a single measurement just copied. For example, the following table shows three disclosures for the well API: 33-007-01818-00-00. Clearly, the lat/lon data come from a single measure:

But if they are not identical, we can get an idea of the measurement error.  We would like the differences to be at or below the 6 decimal digits we talked about above.

Latitude difference among 4024 wells 
with multiple disclosures:
  -                           No difference: 47%
  - smaller than 5 decimal digit difference: 14%
  - greater than 5 decimal digit difference: 39%

Longitude difference among 4024 wells
with multiple disclosures:
  -                           No difference: 47%
  - smaller than 5 decimal digit difference: 13%
  - greater than 5 decimal digit difference: 40%

 

Upshot

So the bottom line is that producing a ‘standardized’ projection for all lat/lon data in FracFocus would be a mirage.  Far too many disclosures have only coarse reporting and consistency.  It would be like measuring the distance to the grocery store in millimeters: maybe technically possible, but not useful and probably an illusion.

[UPDATE: There are other reasons than just accuracy in Google Map views to standardize projections.  I’m working on that for version 15.]

Digging into Random Forests

More than a year ago, I worked through a few Fast.ai courses:

It was a fantastic experience: an introduction to the nuts and bolts of applying machine intensive analysis to real problems in a way that one could see results almost immediately; a code-based approach (it helped that I already knew lots of Python) that got into the nitty-gritty of solving tough problems; an emphasis on working habits that led to an agile approach; and an amazingly active and collaborative community of students.  Somehow, whatever Jeremy Howard says always makes sense and is steeped in hard-won expertise.  It was a perfect fit for me and I learned a ton.

But honestly, I have not applied the Fast.ai material to my own projects much at all.  I think the simple reason is that there is an inertia I have to overcome when I start a Fast.ai type analysis: securing enough GPU computing power, making sure all the modules are up-to-date and aligned to the project at hand, relearning the processes that Jeremy outlined and adapting them to the project.  While I am sure that many Fast.ai students are plugged in enough to make those obstacles relatively trivial, my available time and resources just aren’t a good match to make all that work.

But I’ve realized that there should be a lot I can do towards implementing a machine learning approach without needing the heavy infrastructure, especially since the problems I’m likely to tackle aren’t necessarily the massive ones targeted in the courses.  Jeremy’s approach covered tools students need to tackle some of the biggest problems – but that brought lots of complexity.  I just need to pare it down to the basics –  Fast.ai on the cheap – and create a learning environment for myself in which I can make some progress.

The projects I have in mind right now would be best served by an exploration of Random Forest methodologies.  That’s where I will start…

 

An empirical exploration of the Central Limit Theorem

The way I understand it, the Central Limit Theorem states that a distribution of sampled means of ANY distribution will be normal.

Well, sort of.

From Wikipedia:

In probability theory, the central limit theorem (CLT) establishes that, in some situations, when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a “bell curve”) even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

in another place:

The central limit theorem states that the sum of a number of independent and identically distributed random variables with finite variances will tend to a normal distribution as the number of variables grows.

In general, I only start to understand such concepts if I can play around with them.  So let’s see how this works! We will start by defining a function that will perform the sampling to create distributions of means.

In the code below, s in the function getSamp is the name of the distribution that we will be testing.


#preamble 
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

import pandas as pd
import numpy as np

nbins =100
def getSamp(s='normal(5,3,',n=1000,reps=10000):
    ddist = eval('np.random.'+s+'size=n)')
    xb = eval('np.random.'+s+'size=(n,reps))')
    f, (ax1,ax2) = plt.subplots(ncols=2,figsize=(10,5))
    ax1.set_title('Data distribution')
    ax1.hist(ddist,bins=nbins)
    ax2.set_title('Sample distribution of the mean for: '+s+')')
    mb = xb.mean(axis=0)
    return ax2.hist(mb,bins=nbins) # tmp holds the arrays used to calc

tmp = getSamp()

For the default (normal distribution, mean = 5, std. dev = 3), here’s the distribution and the results distribution of means:

Let’s define a list of distributions that we want to run against.


dist = ['normal(5,3,',
        'uniform(low=0,high=5,',
        'binomial(n=100,p=.5,',
        'logistic(',
        'laplace(',
        'exponential(',
        'lognormal(',
        'poisson(',
        'poisson(lam=50,',
        'power(a=2,',
        'power(a=50,',
        'power(a=10000,',
        'wald(mean=5,scale=1,']

Then run it:


for d in dist:
    tmp = getSamp(s=d)

For uniform and binomial distributions:

For exponential and lognormal distributions:

How about a couple of different poisson distributions?

That default poisson yields a normal-looking mean distribution, but clearly the highly discrete nature of the source distribution has an effect on the mean’s distribution.

But overall, it seems to hold that the distribution of the means are normal. (You might want to try other distributions for yourself…)