Posts

Dictionaries and keys

finishing up Alice and word count (exercise 7):
Screen Shot 2014-05-14 at 10.20.05 AM

from the python tutorial: tutorial

all of these values will set up a dictionary

a = dict(one=1, two=2, three=3)
b = {'one': 1, 'two': 2, 'three': 3}
c = dict(zip(['one', 'two', 'three'], [1, 2, 3]))
d = dict([('two', 2), ('one', 1), ('three', 3)])
e = dict({'three': 3, 'one': 1, 'two': 2})
a == b == c == d == e

creating our dict:
Screen Shot 2014-05-14 at 11.19.52 AM

adding to it:
Screen Shot 2014-05-14 at 11.21.18 AM

from the tutorial:
Screen Shot 2014-05-14 at 11.49.37 AM

Homework 1



Homework 1

tasks in the homework:

We want to address the question of whether women talk more than men. To answer this, we will use the Fisher corpus on Carmen.
Write a python script that outputs —

  • the raw total number of words spoken by women
  • the raw total number of words spoken by men
  • the total number of utterances spoken by women
  • the total number of utterance spoken by men
  • the average number of words per utterance spoken by women and by men
  • the number of female speakers
  • the number of male speakers

the previous assignment plus opening the Fisher files with nested ‘for’ loops gives:
HW1_output

which is the output of this loop (so far) written in class:

# ------ importing all directories ----- #
import os

# ----- initialization of tracking variables ----- #
totalWordsSpoken = 0
totalUtterances = 0

wordsW = 0
wordsM = 0
wordsN = 0

utterW = 0
utterM = 0
utterN = 0

# ----- listing top-level directory ----- #
dir = "Fisher"
dirA = os.listdir(dir)

# ----- listing subdirectories ----- #
for dirB in dirA:
    dirC = dir + "/" + dirB
    fileA = os.listdir(dirC)
    
    # ----- opening files ----- #
    for fileB in fileA :
        path = dirC + "/" + fileB
        fileC = open(path)

        # -----	for loop ----- #
        for sentence in fileC:
            
            # -----	processing block ----- #
            words = sentence.split()
            onlywords = words[3:]
            genderLetter = words[2][2]
            speakerID = words[2][0]
            numberwords = len(onlywords)
            onlysen = " ".join(onlywords)
            
            # -----	count ----- #
            totalWordsSpoken += numberwords
            totalUtterances += 1
            
            # -----	gender output ----- #
            if genderLetter == 'f':
                gender = "woman"
                wordsW += numberwords
                utterW += 1
            elif genderLetter == 'm':
                gender = "man"
                wordsM += numberwords
                utterM += 1
            else:
                gender = "non-gendered person"
                wordsN = numberwords
                utterN += 1

            # -----	output per sentence----- #
            print()
            print("  ", "sentence number", totalUtterances, ":", onlysen)
            print("  ", "number of words:", numberwords)
            print("  ", "words are:", onlywords)
            print("  ", "speaker ID:", speakerID)
            print("  ", "speaker is a:", gender)

# -----	final totals output ----- #
print()
print("  ", "total number of words:", totalWordsSpoken)
print("  ", "total number of utterances:", totalUtterances)
print("  ", "the average number of words per utterance was :",
      totalWordsSpoken / totalUtterances)
print()
print("  ", "total words spoken by women:", wordsW)
print("  ", "total number of utterances:", utterW)
print("  ", "the average number of words per utterance was :",
      wordsW / utterW)
print()
print("  ", "total words spoken by men:", wordsM)
print("  ", "total number of utterances:", utterM)
print("  ", "the average number of words per utterance was :",
      wordsM / utterM)
print()
print("  ", "total words unaccounted for by gender:", wordsN)
print("  ", "total number of utterances unaccounted for:", utterN)
print()

need to add some stuff in to count the number of speakers of each gender.

having a bit of trouble with the if/else statements here and the nesting. I’m going to build a code for only one file, extend it to two, and then run it on the entire corpus once I can differentiate between the ‘A-f:’ in one file, and the ‘A-f:’ in the next file in the list. I’m also assuming that there are never any repeated speakers across files.
Here are the counts for fe_03_06500.txt to compare the code against:
(there is 1 female speaker, and 1 male speaker.
grep_speakerID_06500

here is my speaker count abstraction (I’m sure there’s an easier way to do this):

# --- babyHW1.py --- #

# --- instantiate --- #
a = 13 	#w
b = 0	#w
c = 2	#m
d = 9	#m
e = 0	#n

w = 0
m = 0
n = 0

# --- speaker count --- #
if a > 0:
	w += 1
else:
	w = 0

if b > 0:
	w += 1
else:
	w = 0

if c > 0:
	m += 1
else:
	m = 0

if d > 0:
	m += 1
else:
	m = 0

if e > 0:
	n += 1
else:
	n = 0

# --- output --- #
print("w:", w)
print("m:", m)
print("n:", n)

A and B loops

import os

countX = 0
countA = 0
countB = 0

dirX = "Fisher"
listA = os.listdir(dirX)
countX += 1

for itemB in listA:
    pathC = dirX + "/" + itemB
    listD = os.listdir(pathC)
    countA += 1
    
    for itemE in listD:
        pathF = pathC + "/" + itemE
        itemG = open(pathF)
        countB += 1

        for fileH in itemG:
            outputI = len(fileH)


print()
print(dirX)
print(itemB)
print(itemE)
print(itemG)
print(outputI)

print()
print(" lines/items in --")
print("   --       Fisher =", countX)
print("   --    Fisher/.. =", countA)
print("   -- Fisher/../.. =", countB)
print()

Python IDs



[pdf]

module.fuction("argument")

here,
module = os
funtion = listdir
argument = path

so…

import os
os.listdir("Fisher")   
# This is a relative path (relative to where I am)\
  which director I start in

Exercise 4



Exercise 4

In class, we have seen how to read from a file. Here is what the code looks like so far:

#------------initialization of tracking variables-------------------
totalWordsSpoken = 0
totalUtterances = 0

#------------open the file-------------------------

fisherFile = open("Fisher/065/fe_03_06500.txt")

#-----------processing block---------------------

for line in fisherFile:
        #list of the items in the line
        words = line.split()

        print("Here are all the words", words)

        #extracting speaker ID
        speaker = words[2]

        #actual words uttered by the speaker
        actualWords = words[3:]

        print("the sentence is spoken by", speaker)
        print("their actual utterance was", actualWords)
        print("the sentence has", len(actualWords), "words")

        totalWordsSpoken += len(actualWords)
        totalUtterances += 1

#---------done with all the sentences; post-analysis----------

print("the total number of words spoken was", totalWordsSpoken)
print("the total number of utterances was", totalUtterances)
print("the average number of words per utterance was",
      totalWordsSpoken / totalUtterances)

Now we want to keep track of the gender information too. We want the total of words and the total of utterances uttered by women as well as the total of words and the total of utterances uttered by men. Look at the notes and adapt your code to use a “if statement” to do so. Make sure your code runs ;-) The notes give the results.

wrote this code in class (see Python scripting)

# ----- initialization of tracking variables ----- #
totalWordsSpoken = 0
totalUtterances = 0

wordsW = 0
wordsM = 0
wordsN = 0

utterW = 0
utterM = 0
utterN = 0

# ----- opening file ----- #

fF = open("../Downloads/Fisher/065/fe_03_06500.txt")


# -----	for loop ----- #
for sentence in fF:
	
    # -----	processing block ----- #
    words = sentence.split()
    onlywords = words[3:]
    genderLetter = words[2][2]
    speakerID = words[2][0]
    numberwords = len(onlywords)
    onlysen = " ".join(onlywords)
	
    # -----	count ----- #
    totalWordsSpoken += numberwords
    totalUtterances += 1

    # -----	gender output ----- #
    if genderLetter == 'f':
        gender = "woman"
        wordsW += numberwords
        utterW += 1
    elif genderLetter == 'm':
        gender = "man"
        wordsM += numberwords
        utterM += 1
    else:
        gender = "non-gendered person"
        wordsN = numberwords
        utterN += 1

    # -----	output per sentence----- #
    print()
    print("  ", "sentence number", totalUtterances, ":", onlysen)
    print("  ", "number of words:", numberwords)
    print("  ", "words are:", onlywords)
    print("  ", "speaker ID:", speakerID)
    print("  ", "speaker is a:", gender)

# -----	final totals output ----- #

print()
print("  ", "total number of words:", totalWordsSpoken)
print("  ", "total number of utterances:", totalUtterances)
print("  ", "the average number of words per utterance was :",
      totalWordsSpoken / totalUtterances)
print()
print("  ", "total words spoken by women:", wordsW)
print("  ", "total number of utterances:", utterW)
print("  ", "the average number of words per utterance was :",
      wordsW / utterW)
print()
print("  ", "total words spoken by men:", wordsM)
print("  ", "total number of utterances:", utterM)
print("  ", "the average number of words per utterance was :",
      wordsM / utterM)
print()
print("  ", "total words unaccounted for by gender:", wordsN)
print("  ", "total number of utterances unaccounted for:", utterN)
print()

which returns this output on the macs at school (haven’t tried it at home yet)
Screen Shot 2014-05-08 at 11.54.15 AM

Python scripting



open interactive Python

>>> fisherFile = open("Fisher/065/fe_03_06500.txt")

Creates a file object

“tail” in Unix prints ~ the last ten lines of a flie
(use as such):

tail "Fisher/065/fe_03_06500.txt"

this code to count the number of words by gender:

# ----- initialization of tracking variables ----- #
totalWordsSpoken = 0
totalUtterances = 0

wordsW = 0
wordsM = 0
wordsN = 0

utterW = 0
utterM = 0
utterN = 0

# ----- opening file ----- #

fF = open("../Downloads/Fisher/065/fe_03_06500.txt")


# -----	for loop ----- #
for sentence in fF:
	
    # -----	processing block ----- #
    words = sentence.split()
    onlywords = words[3:]
    genderLetter = words[2][2]
    speakerID = words[2][0]
    numberwords = len(onlywords)
    onlysen = " ".join(onlywords)
	
    # -----	count ----- #
    totalWordsSpoken += numberwords
    totalUtterances += 1

    # -----	gender output ----- #
    if genderLetter == 'f':
        gender = "woman"
        wordsW += numberwords
        utterW += 1
    elif genderLetter == 'm':
        gender = "man"
        wordsM += numberwords
        utterM += 1
    else:
        gender = "non-gendered person"
        wordsN = numberwords
        utterN += 1

    # -----	output per sentence----- #
    print()
    print("  ", "sentence number", totalUtterances, ":", onlysen)
    print("  ", "number of words:", numberwords)
    print("  ", "words are:", onlywords)
    print("  ", "speaker ID:", speakerID)
    print("  ", "speaker is a:", gender)

# -----	final totals output ----- #

print()
print("  ", "total number of words:", totalWordsSpoken)
print("  ", "total number of utterances:", totalUtterances)
print("  ", "the average number of words per utterance was :",
      totalWordsSpoken / totalUtterances)
print()
print("  ", "total words spoken by women:", wordsW)
print("  ", "total number of utterances:", utterW)
print("  ", "the average number of words per utterance was :",
      wordsW / utterW)
print()
print("  ", "total words spoken by men:", wordsM)
print("  ", "total number of utterances:", utterM)
print("  ", "the average number of words per utterance was :",
      wordsM / utterM)
print()
print("  ", "total words unaccounted for by gender:", wordsN)
print("  ", "total number of utterances unaccounted for:", utterN)
print()

returns this result:
Screen Shot 2014-05-08 at 11.54.15 AM

how do we check that our output is correct? Count them manually?

Exercise 3



exercise 3

In class, we have started writing a script that processes two hard-coded sentences, and keep a running total of the number of words actually uttered by the speakers. We did this in a dumb way: copy-pasting the processing block. Now simplify the script using a “for loop”. The idea should be to replace the two copies of the processing block with a single copy inside a loop. To do this, you’ll need to create a list containing the two sentences, write a “for loop”” operating over this list, and put the processing block inside that loop. You can of course have more than 2 sentences in the list ;-)

Here is what the code looks like so far (also posted on the website):

# initialization
# keep track of total of words
totalWords = 0


# processing of sentence 1
sentence = "B-f: I'm in graduate school"

# get the words of the sentence
words = sentence.split()

print("Words of sentence 1:", words)

# extract speaker ID
speaker = words[0]

# extract words uttered (everyting except first element in the list)
actualWords = words[1:]

# number of words uttered
numberAWords = len(actualWords)

# increment the total
totalWords = totalWords + numberAWords

print("speaker is: ", speaker)
print("words are", actualWords)
print("number of words:", numberAWords)
print("total so far:", totalWords)

# processing of sentence 2
sentence = "A-f: at OSU?"

# get the words of the sentence
words = sentence.split()

print("Words of sentence 2:", words)

# extract speaker ID
speaker = words[0]

# extract words uttered (everyting except first element in the list)
actualWords = words[1:]

# number of words uttered
numberAWords = len(actualWords)

# increment the total
totalWords = totalWords + numberAWords

print("speaker is: ", speaker)
print("words are", actualWords)
print("number of words:", numberAWords)
print("total so far:", totalWords)

# post-analysis
print("total words uttered:", totalWords)

my code:


# -- initialization of tracking variables
totalWordsSpoken = 0
totalUtterances = 0

# -----	variables in list ----- #

s1 = "B-f: I'm in graduate school"
s2 = "A-m: at OSU?"
s3 = "B-f: um, yeah. Here at OSU"
s4 = "A-m: cool, me too."
s5 = "C-?: me want cookies!! nom nom nom"

senList = [s1, s2, s3, s4, s5]

# -----	for loop ----- #
for sentence in senList:
	
	# -----	processing block ----- #
	words = sentence.split()
	onlywords = words[1:]
	genderLetter = words[0][2]
	speakerID = words[0][0]
	numberwords = len(onlywords)
	
	# -----	count ----- #
	totalWordsSpoken += len(onlywords)
	totalUtterances += 1
	
	# -----	gender output ----- #
	if genderLetter == 'f':
		gender = "woman"
	elif genderLetter == 'm':
		gender = "man"
	else:
		gender = "non-gendered person"

	# -----	output ----- #
	print("  ")
	print("  ", "sentence number", totalUtterances, ":", sentence[5:])
	print("  ", "number of words:", numberwords)
	print("  ", "words are:", onlywords)
	print("  ", "speaker ID:", speakerID)
	print("  ", "speaker is a:", gender)
		
# -----	final output : counts outside the loop
#		how does it know to exit? no indentation?

print("  ")
print("  ", "total number of words:", totalWordsSpoken)
print("  ", "total number of utterances:", totalUtterances)
print("  ", "the average number of words per utterance was :",
            totalWordsSpoken / totalUtterances)

print("  ")

returns these results:
EX3_for_loop_output

A problem that I ran into was trying to make the elements in the list strings