Scraping in Selenium – Ohio Department of Education

September 18 2023September 18, 2023

Recently I was approached to help with a scraping project. As I love scraping and formatting data I excitedly agreed. I also thought it’d be a good way to get experience using selenium.

The goal: scrape information from the Ohio Department of Education’s Reports Portal (https://reports.education.ohio.gov/finance/legacy-report).

Let’s consider the 2010 PASS reports- if you click on the link above it will take you to the legacy report webpage for the Ohio Department of Education. From there, you can select “LEA Type” i.e., Traditional School District and “Fiscal Year”, 2010.

Then you press the button “Go.”

From there, you need to fill-in “Payment” with “Final #5” and, finally, your “LEA” where you can select from the drop down any school district you’d like, by Name of the District or unique district ID.

Once you select a district, let’s say, Ada Exempted Village (Hardin) – 045187, then you have to click “Process the Report”

It will look like this:

Once you click process the report this page loads:

Now we want to scrape some data! Specifically, further down the page we’ll see our variables of interest, i.e. rows 1A Special Ed. Cat. , 1B Special Ed. Cat. 2, …, 1F Special Ed. Cat. 6, for two different column values: KDG (kindergarten) and Total.

As a human, we see this information and it’s easy to get the data by copy and pasting, however, doing do for all 615 school districts would be time consuming and tedious: enter selenium! Note: all of this code is available on my github.

Let’s first import our necessary packages,

from selenium import webdriver 
from selenium.webdriver import Chrome 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.common.by import By 
from webdriver_manager.chrome import ChromeDriverManager

From here we can set the path we want, i.e. we want to open a chrome browser! and then we want to wait to let the page load, i.e. the third line of code

#set chromodriver.exe path
driver = webdriver.Chrome(executable_path="C:\chromedriver.exe")
driver.implicitly_wait(0.5)

Now, we want to load the page of interest, we use a driver.get(url) to load the url, and we defined the driver as Chrome in the code snippet above.

url = "https://reports.education.ohio.gov/finance/legacy-report"  
driver.get(url)

Now, we want to fill in the text boxes with information necessary to generate a report, i.e.

# fill in LEA type  
text_box_path = '/html/body/app-root/eas-ui-lib-spinner/div/eas-core-application-container/div/div[2]/as-split/as-split-area[2]/div/div/app-legacy-reports/lib-legacy-report/div/div/div/div[2]/ul/li[1]/eas-ui-lib-form-group/label/ng-select/div/div/div[3]/input' 
text_box = driver.find_element(By.XPATH, text_box_path) 
text_box.send_keys("Traditional School District")

Now, you might be thinking, how on Earth did you get that text_box_path? great question. First, right click on the textbox and click “Inspect.”

This will pop up. After it appears, right click again, hover over “Copy” and then click “XPath”

That’s how you get the location! Can then do the same thing to fill in the year, then press “Go” i.e. you’ll find the path to Go and then “click” GO!

go_path = "/html/body/app-root/eas-ui-lib-spinner/div/eas-core-application-container/div/div[2]/as-split/as-split-area[2]/div/div/app-legacy-reports/lib-legacy-report/div/div/div/div[2]/ul/li[1]/eas-ui-lib-form-group/label/ng-select/ng-dropdown-panel/div/div[2]/div"
drop_box_1 = driver.find_element(By.XPATH, go_path)
time.sleep(1)
drop_box_1.click() #click go

Now another drop down will pop-up, fill that in and finally fill in your LEA, this is done within in a loop in the code, so it loops through all possible options of the dropdown & then saves all the data you need!

Let’s think about how to save the data we need, you can save all of the page text data using:

web_text= driver.page_source

And from there, you’ll just have to be clever how to split & strip your data! Honestly, figuring this out is the fun part!

sped_teachers_calc_i= web_text.split("Ohio Department")[1].split('\n Special Education Teachers')[1].splitlines()[0].strip().split()[0]
sped_teachers_funded_i = web_text.split("Ohio Department")[1].split('\n Special Education Teachers')[1].splitlines()[0].strip().split()[1]

And then, all you have to do is store your data, I did so in a dataframe, so each loop created. row of data corresponding to that district, and then just appended that row to the end of the dataframe at the end of the loop.

And then, run the code!! Below is what you’ll see on your screen if you run the code, namely, it will loop through and fill in the values for you &then save the results to a dataframe! The video below shows the first 2 iterations of the loop, as I’m sure you can tell, the code gets all the values much faster (and more accurately) than you or I could :)

There are other examples within my github for different variables, different years had different formats, so each format required it’s own slightly tweaked version of the code- they are all on github for your perusal!

Good luck scraping!!

Intro to Multiprocessing in Python

February 23 2023February 23, 2023

Hi all,

I’ve been very busy applying to internships and working on research. However, I wanted to take some time to write a short blog post on multiprocessing in Python. I recently had to re-learn (whoops) how to do it, and wanted to share the info here in case it can help someone else!

First off, what is multiprocessing? And how does it differ from multithreading?

Let’s first think about the difference between a process and a thread. Put simply, a process is a full program and it’s execution. A thread is a segment of a process.

So, multithreading has multiple threads on a single processor whereas multiprocessing has single (can be multiple) threads on multiple, separate, processors.

For the purpose of this blog post we will be working through an example of multiprocessing. Sometimes in Python we write a for loop to execute some function and we want to run this function a significant number of times. For example, I recently was working with a dataset where for each row I wanted to take some data within the row and return/calculate something, in my case I wanted to run some text data through a classifier. Now, if the function doesn’t require the previous row’s result and can be run independently, this is a prime use-case for multiprocessing which will decrease the total computational time of your program. If you have 10,000 rows and use a single processor, it will be slower than if you multiple-processors, since you essentially can be running multiple rows of your dataset through your function at the exact same time.

Below is a very simple example of multiprocessing to provide some set-up code. The code below is broken into three main sections: imports, task, main function.

 
import pandas as pd
import numpy as np
import multiprocess as mp
from tqdm import tqdm
import math

def task(arg):
    return (math.sqrt(arg))
 
if __name__ == '__main__':
    pool = mp.Pool(processes=2) #number of cores 
    inputs = range(1, 5000) #range we are iterating over
    results = []
    for result in tqdm(pool.imap(task, inputs), total=len(inputs)):
        results.append(result)

So, first the imports (lines 1-5). I imported pandas and numpy out of habit, you’ll need to import multiprocess in order to do multiprocessing and then if you also import tdqm we can display a progress bar of the work. The last import is math so we can compute a square root.

GOAL: Find and store the square root of the numbers 1 through 4999.

What’s our task then? i.e. what is the function we want to run on every number from 1 to 4999? We want the square root, so our task is the square root. So we define our task function to take in an argument, and return the square root of the argument (lines 7 & 8).

Now, what goes in our main function (lines 10-15)? This is where the multiprocessing magic happens! We first want to create our Pool, and put in the number of processes we want to use, in this case my computer has 2 cores (embarrassing), so processes=2. Our inputs are the numbers 1 through 4999. We initialize an empty list to store our results and then write our for loop to save the results. Recall tdqm isn’t scary, it is just included so when we run the code a progress bar shows up, which is technically unnecessary but for longer projects it is very useful to know how far you are a long.

What about pool.imap()? Well, the documentation tells us it takes in a function and an iterable, so in our case the function we want to complete is our task function, and we are iterating over our input values, range(1,5000). Note, the output of imap will be ordered, so our results vector will have the results in order. However, if you mess around with print statements you’ll see that thought the results are outputted are ordered the program does not run our function “in order” i.e. it doesn’t start with 1 and 2, then 3 and 4. Lastly, we append our result to our results vector.

note: when I run this with processes=2, I get about 1961.24it/s (iterations per second) and when I run with processes=1 it’s 2044.88it/s. So is it doubly fast with 2 cores versus 1? No. Is it faster? Yes!

Now, how do we know it worked? Well, if you open your Activity Monitor while the code is running rather than seeing 1 python you’ll see 2! (or n, where n is the number of cores you use).

Hopefully this short little tutorial can help you get started with multiprocessing, or made you realize more of the code you write you *could* implement multiprocessing to speed things up.

Additional Resources:

https://superfastpython.com/multiprocessing-for-loop/

https://www.geeksforgeeks.org/multithreading-python-set-1/

https://www.indeed.com/career-advice/career-development/multithreading-vs-multiprocessing

https://towardsdatascience.com/multithreading-and-multiprocessing-in-10-minutes-20d9b3c6a867

visualizing the violence project data

November 11 2022November 11, 2022

I took a data visualization course through the computer science department last semester and I loved it.

For our final project my partner (Robert Frenken) and I used data from The Violence Project which looks at the profiles of mass shooters from the U.S.

Check out the interactive visualizations we made through stre amlit or click on the static image below to get to the interactive version!

cdc’s convenient covid color choices

August 16 2022October 3, 2022

Hello all,

With the COVID-19 pandemic still happening I wanted to take some time to look at the impact of color in figures.

I took a data visualization course last semester and we talked about the importance of color on evoking emotion from data. Consider the CDC’s community level color scheme (top) to their community transmission color scheme (bottom), one clearly “looks worse” than the other.

This is because the highest transmission color is red, compared to the highest community level color of orange. So, what happens when we compare the community level map (top) but use the colors from the community transmission map (bottom).

The bottom figure looks a *lot* more red, and, frankly, scarier than the top figure. The CDC’s choice to use orange as “High” in the top figure is to reduce mass panic, however, maybe some panic is warranted given the levels of transmission.

See full thread on Twitter. Alright, I should really be studying for my comprehensive exams!

kirchhoff’s matrix-tree theorem

October 28 2021October 28, 2021

Hello everyone!

With the semester almost halfway over I realized it was beyond time I did some cool math.

Today I’ll be walking you through a proof of Kirchhoff’s matrix-tree theorem. Which is *super* important in the world of graph theory and has seriously awesome implications for computer science- as it takes a potentially super-exponential problem and solves it in polynomial time!

So, what does Krichhoff’s theorem even state?

The number of nonidentical spanning trees of a graph G is equal to any cofactor of its Laplacian matrix.

Ah! That’s a lot of mathy words. Let’s take them one by one…

Now, this still looks scary, so let’s look at an example.

Okay! Now we are so close to what we need to calculate the total number of spanning trees of a graph G…we now know how to computer a cofactor, but we need to find the cofactor of a particular matrix, the Laplacian of a graph G.

Math is pesky in the sense that by defining one thing you may need to find the definition of another. In this case we need the degree matrix of a graph.

Now that we have that definition and the degree matrix for our recurring example on graph G, let’s calculate the Laplacian Matrix! All we need to do is subtract the adjacency matrix from the degree matrix.

Okay, awesome, let’s take this example one step further and calculate the cofactor of the laplacian matrix of graph G (or, via kirchhoff’s theroem, the number of unique spanning trees of G). Let’s take a step back and think about putting everything together mathematically.

The last thing we need to do is to find the cofactor of our Laplacian Matrix…which we do the exact same way we did above. Let i =j =1 (because any values for i and j will give the same result),

and we get three! This matches the number of spanning trees we originally drew out above. So, amazing, Kirchhoff’s theorem appears to be working correctly. However, we have yet to prove that Kirchhoff’s theorem holds for all graphs G.

Let’s prove a specific case of Kirchhoff’s theorem where i=j, just to make it easier on ourselves…

to understand why this fact is true case consider the permutation sum definition of a determinant,

since we’ve increased the (i,i)th entry, a_ii, to a_ii + 1. So we get the original determinant PLUS a sum over all permutations of [n] \{i}, avoiding the ith row and column, i.e. det(A[i]).

(if you don’t believe this permutation sum def’n is true I completely understand. some extra resources on this step here and here).

So within this new terminology G/e we now have to define what contracted means…(smoosh is a technical term)

The right figure represents i being contracted onto j .

At this point we’ve proved what we’ve set out to prove! how?

WOO! This is a powerful proof, since it allows us to calculate the total number of spanning trees in polynomial time. More on it’s applications soon!

Alright back to (real) work now. If anyone asks I was totally working on my research today.

Resources:

https://cp-algorithms.com/graph/kirchhoff-theorem.html

https://www.tutorialspoint.com/data_structures_algorithms/spanning_tree.html

https://byjus.com/cofactor-formula/

https://en.wikipedia.org/wiki/Kirchhoff%27s_theorem

https://people.orie.cornell.edu/dpw/orie6334/lecture8.pdf

https://mathworld.wolfram.com/MatrixTreeTheorem.html

https://math.stackexchange.com/questions/1665210/fact-regarding-kirchhoffs-theorem

cosponsorship code

April 28 2021

When working on my semester project for Network Analysis I realized there are a significant number of existing data sets and raw data (amazing), but very few tools which help researchers scrape and organize their own data.

So, I thought I’d share some code which I used in order to scrape and create the network of cosponsorships for the 116^th Congress.

You’ll find the code here: https://github.com/caramnix/polymath

So, this code will just get you the adjacency matrix. Let’s quickly define/go over how to interpret this adjacency matrix…

And after outputting the adjacency matrix in Python I personally switched to R to plot it (obligatory network plot below).

To clarify how to interpret this plot,

But, more interesting (at least to me) is if we color the nodes based on Party, we see a stark divide between the parties.

What does this mean? Well, we know Congress is polarized, and (clearly) cosponsorships accurately reflect that polarization!

Cool!

The code also allows one to look at who sponsored the most legislation for the 116th Congress:

And who received the most cosponsors from their colleagues:

If you have any questions or the code isn’t working (/isn’t clear) let me know!

breaking down the eigenvector centrality measure

January 22 2021January 23, 2021

Hey all,

I *just* finished week two of spring semester and I already have quality political science/math material? Say what? I got quite distracted this morning and fell-down the math rabbit hole. So, RIP to all the readings I should have been doing for class when I was instead doing this…

One of my classes this semester is Inferential Network Analysis, which so far I am loving!

In the abstract, the term network in political science is synonymous to graph in mathematics. However, instead of caring about the combinatorial curiosities some graph theorists are interested in, political scientists care about the substantive meaning of the edges and how they impact the network structure, and what they can tell us about relationships between vertices.

A Graph, G, is any collection of vertices and edges, a network has the exact same structure, except the vertices could correspond to Senators, and the edges between them represent if they have cosponsored the same bill.

One of the statistics of interest for a given network at the node-level (as opposed to the network-level) is how influential a node is in a network. This can be measured in multiple ways, the three most common are: degree centrality, closeness centrality, and eigenvalue centrality.

In case you didn’t assume based on the title of this post, this post explores the mathematics behind eigenvector centrality.

Eigenvector centrality is a measure of a vertex’s influence within a network. It measures a vertex’s “power” and it does so by summing the relative influence of a node’s connections. This implies that a vertex’s power will depend on how powerful it’s friends are.

ex: Frankie Jonas is only famous because he is connected to Joe, Nick and Kevin Jonas, but powerful friends can lead to influence, spending on the network of interest.

If a vertex has a high eigenvalue centrality score, this implies the vertex is connected to many vertices who themselves have high scores.

Let’s dig into some notation!

As a reminder, an adjacency matrix is commonly used to store graphs, it looks like this:

A Graph can either be directed or undirected. A direct graph doesn’t assume that the relationship (via the edge) is reciprocated. In adjacency matrix format, you can tell if a graph is directed if the matrix is symmetric, aka if you fold the matrix in half on the diagonal, it is the exact same.

Now that we have that notation down, we can move into the formal definition of eigenvector centrality (relative centrality).

The eigenvector centrality score of a vertex v is defined as:

It is important to note that this equation is defined recursively, it requires finding the eigenvector centrality of all of it’s neighbor nodes.

EX: For example, consider a small example where there are four vertices v1,v2,v3,v4. With graph (left) and corresponding adjacency matrix (right).

To find the eigenvector centrality scores for each vertex (v1,…v4), you apply the definition above like so:

Thankfully (for solving purposes), the original equation can then be written in vector notation as:

So, you just have to solve for lambda and you have the resulting x vector with centrality scores that you are looking for.

Now, how do you do this? Linear algebra. What is more interesting (at least to me) is in the political science literature they add, as a note, that all entries in the eigenvector must be non-negative, so only the greatest eigenvector gives the centrality scores (by Perron-Frobenius theorem).

They hide this math fact, when really they should be screaming:

(by Perron-Frobenius theorem)!

Why should they be screaming? Because it is so cool! How do you prove that the largest eigenvalue exists? It is intuitive that the largest eigenvalue gives you what you want (well, kinda, I’m just assuming it’s for the same reason you choose the largest eigenvalue in PCA, because it explains most of the variance). But it’s existence!? Wild, and your whole measure depends on it!

So, without further ado, the proof of the Perron-Frobenius theorem in question (as it turns out Perron-Frobenius did a lot, so we will focus on just one proof).

Theorem 1: If A is a positive matrix, then there exists a real positive eigenvalue r, which is strictly greater in absolute value than all other eigenvalues, hence r is the spectral radius of A.

Now, don’t panic! I know you are probably wondering what the heck is a spectral radius (I certainly was). And it is an example of math over math-ing things, all it means is the maximum eigenvalue in absolute value.

So, now that we have all the pieces we need, we can put it together to construct the proof!

And there you have it, proof of (one of) the Perron-Frobenius Theorem(s). This proof demonstrates that the largest eigenvalue exists. A big caveat is that this proof requires your matrix to be positive, which in our case isn’t an issue, as it is an adjacency matrix.

And, that’s it! We’ve done it! Don’t you feel warm and bubbly inside at having proved something? I sure do.

Happy Friday, folks! (I have a busy weekend of reading for Public Opinion ahead).

Also, s/o to Jacob Ogden for helping me understand some of the proof!

Resources:

https://en.wikipedia.org/wiki/Eigenvector_centrality

https://en.wikipedia.org/wiki/Perron%E2%80%93Frobenius_theorem#Proof_for_positive_matrices

Hafner-Burton, Emile M. and Alexander H. Montgomery. Centrality in Politics: How Networks Confer Power. http://opensiuc.lib.siu.edu/cgi/viewcontent.cgi?article=1007&context=pnconfs_2010

the linearity of the expectation operator

January 11 2021January 11, 2021

It was the second day of Quantitative Political Analysis 3: Causal Inference, and I heard a phrase that I am not used to in mathematics “just take that as given.” I was shocked. You would almost never hear that in a math class, unless it was assumed as base knowledge (aka you had painstakingly proved it in another class…whether you actually had or not).

However, knowing the proof was relatively straightforward, since I actually had done it before, I decided I was just going to do it again “for fun.”

Now, what exactly was the political science context of this probability proof that managed to sneak it’s way into day two of the course? Well, we were trying to show:

Why? Because in the potential outcomes model, you assume for each individual that there is an outcome Y1 and Y0 for each individual. With the subscript 1 and 0 referring to whether or not an individual was treated (1) or untreated (0). For each individual there exists a counterfactual, aka in a different world what would have happened had you gotten the opposite treatment.

The whole ball game in causal inference is estimating the average causal effect, or what was the effect of the treatment. In the “real-world” you can never know the actual effect of treatment, because an individual can not be both treated and untreated (this is possible if we use time based identification strategies, but then we bring in another dimension and are considering the effect of treatment over time, not at a single point).

So, a simple example, consider you want to estimate the causal effect of door knocking on voting. For an individual you can not measure both the effect on voting if you knock their door and if you don’t knock their door, because you can’t both knock and not knock their door! It isn’t possible. This situation is often referred to as the fundamental problem of causal inference. Notably, if one does knock on someone’s door the individual can still choose to vote or not, as illustrated below.

Circling back to the fundamental problem, all is not lost, because you can infer this average causal effect given a sample of say 100 people.

This is where the linearity of the expectation operator comes in handy…

Because if we can prove that…

…then this leads us to be able to take the expected value of all the treated individuals (D=1) and all the untreated individuals (D=0) and infer the average causal effect from that using a simple mean! See the table below.

For i=1, the individual’s door was not knocked on (D=0), but they did wind up voting (as Y1=1).

For i =2, the individual did have their door knocked on (as D=1), and they did wind up voting (Y1=1), we can not know if they would have voted given their door wasn’t knocked on (Y0=?).

So, let’s say you summed up everything in the Y0 column that wasn’t a ?, divided by the number of entries you had, and you got .2, and for the Y1 column you got .8. You could infer that the average causal effect of door knocking was .6! Because E[Y1-Y0] = E[Y1]- E[Y0]= .8-.6 =.2. And thus door knocking had a positive effect on individuals voting.

Now onto the proof!

And there you go! We have shown that (for discrete Y1 Y0, the proof for continuous variables if very similar, just use integrals!):

Note, this proof is generalizable to:

As it doesn’t matter if we are doing subtraction or addition. Notably, this relationship does NOT hold for multiplication (it requires independence). Which leads me to my last point for this post. Nowhere in the proof above did we require Y0 to be independent of Y1.

Moving away from this example, consider rainfall, if you are looking for the expected value of rain over the weekend it would be, E[rain on Saturday + rain on Sunday] = E[rain on Saturday] + E[rain on Sunday]. Which is slightly counter intuitive, as if it rains on Saturday night any rain on Sunday morning would not be independent of the rain on Saturday, and yet we can still just sum the components (E[rain on Saturday] + E[rain on Sunday]) to find the expected value for the weekend!

And that’s it for this first content post! Should I have be working on my abstract to submit to APSA? Definitely.

pol·y·math

January 9 2021January 9, 2021

Welcome to the blog portion of my website, polymath.

pol·y·math /ˈpälēˌmaTH/
noun: polymath; plural noun: polymaths
a person of wide-ranging knowledge or learning.

My undergraduate degree was originally only in pure mathematics, but early on it felt far too abstract for me and I needed to ground it in reality (rather than define some arbitrary metric space). In my second semester of undergrad I added a political science double major. Hence the name polymath. I soon added a computer science minor because I loved to code and I also dabbled a bit in astrophysics (because…cool).

Throughout all my interests I managed to find connections between these seemingly disparate disciplines. However, the divide between math and political science remained somewhat treacherous. In math, especially pure mathematics, you are in your own little bubble; even applied mathematics is rarely applied to answer social science questions and is usually reserved for physics. In political science, all the mathematics is brushed under the rug, it is a tool that someone else forged, so why bother remaking it.

I argue we should remake the tools. Why? Well, because at least to me, it is worth remaking them in order to better understand how and why we apply them in political science.

All this said, I am not claiming to be a polymath, far from it (I just couldn’t resist the pun). However, I’m hoping to explore the math and statistics behind political science by connecting it to real-world examples and hopefully illuminate the meaning behind mathematical jargon, since it is often (intentionally) very unnecessarily complex. Because of this, there is a lot of fear tied to mathematics that I hope to assuage.

Transitioning from being a math undergraduate to a political science Ph.D. student has not been easy, so I also hope to just do cool math (and math-adjacent things). I promised myself I wouldn’t stop doing cool math when I decided to accept my admission into political science graduate school, and I broke that promise to myself last semester. This blog is for my own amusement and learning as much as it is for you. So, as much as I’ll try to frame all of my posts in a political science lense I might throw in a cool math/compsci proof or two, just for fun!

And maybe, together, we can break down some barriers to understanding and become true polymaths along the way.

Looking forward to learning with you all.

Cara

Ohio State nav bar

Author: nix.39