Scraping

My unit needs to periodically review the ‘people’ pages on our web site. We could do this page-by-page, but with 70-odd pages this is a drag and it limits the our ability to coordinate these checks.

So, I wrote a little Python using lxml to scrape all of our people and produce a tabular file that multiple staff can pop into Excel and edit. The QuickSites don’t provide for mass updates (that’s OK! they do lots of good stuff), but this gets us just a little more efficiency.

import lxml
from lxml.html import parse
import lxml.html as lh
import urllib
rowData = [['Name', 'position', 'education', 'lastnamenum', 'statement']]
outputfile = '/Volumes/WYRK/iprscrape/affils.txt','w'
f = open('/Volumes/WYRK/iprscrape/affils.txt','w')
doc = parse('http://ipr.osu.edu/directory').getroot()
for link in doc.cssselect('a'):
	if link.get('href') and link.text_content() and '/people' in link.get('href'):
		name=None
		role=None
		position=None
		education=None
		personurl = 'http://ipr.osu.edu' + link.get('href')
		result = urllib.urlopen(personurl)
		html = result.read()
		lastnamenum = personurl.replace("http://ipr.osu.edu/people/","",1).strip()
		tree = lh.fromstring(html)
		name = tree.xpath('//*[@id="title"]/text()')
		role = tree.xpath('//*[@id="bio_block"]/div[1]/div/div/text()')
		for div in tree.cssselect('div.field.field-type-text.field-field-ascpeople-position'):
			position = div.text_content().strip()
		bio = tree.xpath ('//*[@id="bio_block"]/ul/li/text()')
		education = tree.xpath('//*[@id="leftcontent"]/div/div[2]/ul/li/text()')
		f.write('{0} \t {1} \t {2} \t {3} \t {4} \t {5}\n' .format(lastnamenum, name, role, position, education, bio))
f.close()

Ohio State nav bar

Scrape personnel data from ASC QuickSites