Connecting a huge PDF library with a bulk download

As I’ve mentioned before, FracFocus does not include the chemical records from 2011-mid 2013 in their bulk download data, even though they still provide individual PDF files from that period.   This is an important time in the fracking boom and the data set would be considerably poorer without records from those years.  Luckily, the environmental NGO, SkyTruth, scraped the PDFs from the FracFocus website until FracFocus started blocking web crawling.   I use this SkyTruth archive to supplement the bulk download from FracFocus.

A lingering worry I’ve had, however, is that the disclosures from those years have been changed – with no record of the change.  That is, SkyTruth’s archive was valid when they collected the data, but companies made changes in those PDFs silently.  So I’ve begun a months-long project to connect the bulk data to the PDF library.

In the first stage, I am simply using the APINumbers of the SkyTruth archive to see if the FracFocus PDFs still even exist.  While that seems simple enough, the FracFocus search page for the PDFs makes the process pretty tedious.  Not to mention, there are something like 35,000 separate SkyTruth disclosures to test.

The solution I’m using is a combination of the Python version of Selenium (a web site testing facility that works as a ‘headless’ browser) and some scripts that slowly requests the PDFs from the SkyTruth records.   This is necessary because the search page does not allow traditional web scraping – it needs a live browser clicking on buttons to fetch the appropriate PDF files.  Once the thing starts running, a Chrome browser pops up by itself, an APINumber is loaded into the appropriate field, a button click is activated, and a search is performed without any human intervention.  Nice!  My scripts keep the found PDFs (for later comparison to the bulk data) and then moves on to the next APINumber.   I keep the speed low to prevent overloading the FracFocus servers (the original reason SkyTruth and others were locked out) so I get about 4 searches each minute.

Today I finished looking through the SkyTruth PDF search data and indeed found 140 disclosures have been removed from the FracFocus system since SkyTruth did their work.  Why were they deleted?  Maybe simple mistakes?  Seems worth looking into.

The next phase will be to compare the SkyTruth data with the actual data in the current PDFs.  That’s a big project, but I think the job is easier now than when SkyTruth was grappling with it.  I’m looking forward to looking into the Camelot project – a simple PDF table reader.