Recently I was approached to help with a scraping project. As I love scraping and formatting data I excitedly agreed. I also thought it’d be a good way to get experience using selenium.
The goal: scrape information from the Ohio Department of Education’s Reports Portal (https://reports.education.ohio.gov/finance/legacy-report).
Let’s consider the 2010 PASS reports- if you click on the link above it will take you to the legacy report webpage for the Ohio Department of Education. From there, you can select “LEA Type” i.e., Traditional School District and “Fiscal Year”, 2010.
Then you press the button “Go.”
From there, you need to fill-in “Payment” with “Final #5” and, finally, your “LEA” where you can select from the drop down any school district you’d like, by Name of the District or unique district ID.
Once you select a district, let’s say, Ada Exempted Village (Hardin) – 045187, then you have to click “Process the Report”
It will look like this:
Once you click process the report this page loads:
Now we want to scrape some data! Specifically, further down the page we’ll see our variables of interest, i.e. rows 1A Special Ed. Cat. , 1B Special Ed. Cat. 2, …, 1F Special Ed. Cat. 6, for two different column values: KDG (kindergarten) and Total.
As a human, we see this information and it’s easy to get the data by copy and pasting, however, doing do for all 615 school districts would be time consuming and tedious: enter selenium! Note: all of this code is available on my github.
Let’s first import our necessary packages,
from selenium import webdriver from selenium.webdriver import Chrome from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By from webdriver_manager.chrome import ChromeDriverManager
From here we can set the path we want, i.e. we want to open a chrome browser! and then we want to wait to let the page load, i.e. the third line of code
#set chromodriver.exe path driver = webdriver.Chrome(executable_path="C:\chromedriver.exe") driver.implicitly_wait(0.5)
Now, we want to load the page of interest, we use a driver.get(url) to load the url, and we defined the driver as Chrome in the code snippet above.
url = "https://reports.education.ohio.gov/finance/legacy-report" driver.get(url)
Now, we want to fill in the text boxes with information necessary to generate a report, i.e.
# fill in LEA type text_box_path = '/html/body/app-root/eas-ui-lib-spinner/div/eas-core-application-container/div/div[2]/as-split/as-split-area[2]/div/div/app-legacy-reports/lib-legacy-report/div/div/div/div[2]/ul/li[1]/eas-ui-lib-form-group/label/ng-select/div/div/div[3]/input' text_box = driver.find_element(By.XPATH, text_box_path) text_box.send_keys("Traditional School District")
Now, you might be thinking, how on Earth did you get that text_box_path? great question. First, right click on the textbox and click “Inspect.”
This will pop up. After it appears, right click again, hover over “Copy” and then click “XPath”
That’s how you get the location! Can then do the same thing to fill in the year, then press “Go” i.e. you’ll find the path to Go and then “click” GO!
go_path = "/html/body/app-root/eas-ui-lib-spinner/div/eas-core-application-container/div/div[2]/as-split/as-split-area[2]/div/div/app-legacy-reports/lib-legacy-report/div/div/div/div[2]/ul/li[1]/eas-ui-lib-form-group/label/ng-select/ng-dropdown-panel/div/div[2]/div" drop_box_1 = driver.find_element(By.XPATH, go_path) time.sleep(1) drop_box_1.click() #click go
Now another drop down will pop-up, fill that in and finally fill in your LEA, this is done within in a loop in the code, so it loops through all possible options of the dropdown & then saves all the data you need!
Let’s think about how to save the data we need, you can save all of the page text data using:
web_text= driver.page_source
And from there, you’ll just have to be clever how to split & strip your data! Honestly, figuring this out is the fun part!
sped_teachers_calc_i= web_text.split("Ohio Department")[1].split('\n Special Education Teachers')[1].splitlines()[0].strip().split()[0] sped_teachers_funded_i = web_text.split("Ohio Department")[1].split('\n Special Education Teachers')[1].splitlines()[0].strip().split()[1]
And then, all you have to do is store your data, I did so in a dataframe, so each loop created. row of data corresponding to that district, and then just appended that row to the end of the dataframe at the end of the loop.
And then, run the code!! Below is what you’ll see on your screen if you run the code, namely, it will loop through and fill in the values for you &then save the results to a dataframe! The video below shows the first 2 iterations of the loop, as I’m sure you can tell, the code gets all the values much faster (and more accurately) than you or I could :)
There are other examples within my github for different variables, different years had different formats, so each format required it’s own slightly tweaked version of the code- they are all on github for your perusal!
Good luck scraping!!