My unit needs to periodically review the ‘people’ pages on our web site. We could do this page-by-page, but with 70-odd pages this is a drag and it limits the our ability to coordinate these checks.
So, I wrote a little Python using lxml to scrape all of our people and produce a tabular file that multiple staff can pop into Excel and edit. The QuickSites don’t provide for mass updates (that’s OK! they do lots of good stuff), but this gets us just a little more efficiency.
import lxml from lxml.html import parse import lxml.html as lh import urllib rowData = [['Name', 'position', 'education', 'lastnamenum', 'statement']] outputfile = '/Volumes/WYRK/iprscrape/affils.txt','w' f = open('/Volumes/WYRK/iprscrape/affils.txt','w') doc = parse('http://ipr.osu.edu/directory').getroot() for link in doc.cssselect('a'): if link.get('href') and link.text_content() and '/people' in link.get('href'): name=None role=None position=None education=None personurl = 'http://ipr.osu.edu' + link.get('href') result = urllib.urlopen(personurl) html = result.read() lastnamenum = personurl.replace("http://ipr.osu.edu/people/","",1).strip() tree = lh.fromstring(html) name = tree.xpath('//*[@id="title"]/text()') role = tree.xpath('//*[@id="bio_block"]/div[1]/div/div/text()') for div in tree.cssselect('div.field.field-type-text.field-field-ascpeople-position'): position = div.text_content().strip() bio = tree.xpath ('//*[@id="bio_block"]/ul/li/text()') education = tree.xpath('//*[@id="leftcontent"]/div/div[2]/ul/li/text()') f.write('{0} \t {1} \t {2} \t {3} \t {4} \t {5}\n' .format(lastnamenum, name, role, position, education, bio)) f.close()