Homework 1



Homework 1

tasks in the homework:

We want to address the question of whether women talk more than men. To answer this, we will use the Fisher corpus on Carmen.
Write a python script that outputs —

  • the raw total number of words spoken by women
  • the raw total number of words spoken by men
  • the total number of utterances spoken by women
  • the total number of utterance spoken by men
  • the average number of words per utterance spoken by women and by men
  • the number of female speakers
  • the number of male speakers

the previous assignment plus opening the Fisher files with nested ‘for’ loops gives:
HW1_output

which is the output of this loop (so far) written in class:

# ------ importing all directories ----- #
import os

# ----- initialization of tracking variables ----- #
totalWordsSpoken = 0
totalUtterances = 0

wordsW = 0
wordsM = 0
wordsN = 0

utterW = 0
utterM = 0
utterN = 0

# ----- listing top-level directory ----- #
dir = "Fisher"
dirA = os.listdir(dir)

# ----- listing subdirectories ----- #
for dirB in dirA:
    dirC = dir + "/" + dirB
    fileA = os.listdir(dirC)
    
    # ----- opening files ----- #
    for fileB in fileA :
        path = dirC + "/" + fileB
        fileC = open(path)

        # -----	for loop ----- #
        for sentence in fileC:
            
            # -----	processing block ----- #
            words = sentence.split()
            onlywords = words[3:]
            genderLetter = words[2][2]
            speakerID = words[2][0]
            numberwords = len(onlywords)
            onlysen = " ".join(onlywords)
            
            # -----	count ----- #
            totalWordsSpoken += numberwords
            totalUtterances += 1
            
            # -----	gender output ----- #
            if genderLetter == 'f':
                gender = "woman"
                wordsW += numberwords
                utterW += 1
            elif genderLetter == 'm':
                gender = "man"
                wordsM += numberwords
                utterM += 1
            else:
                gender = "non-gendered person"
                wordsN = numberwords
                utterN += 1

            # -----	output per sentence----- #
            print()
            print("  ", "sentence number", totalUtterances, ":", onlysen)
            print("  ", "number of words:", numberwords)
            print("  ", "words are:", onlywords)
            print("  ", "speaker ID:", speakerID)
            print("  ", "speaker is a:", gender)

# -----	final totals output ----- #
print()
print("  ", "total number of words:", totalWordsSpoken)
print("  ", "total number of utterances:", totalUtterances)
print("  ", "the average number of words per utterance was :",
      totalWordsSpoken / totalUtterances)
print()
print("  ", "total words spoken by women:", wordsW)
print("  ", "total number of utterances:", utterW)
print("  ", "the average number of words per utterance was :",
      wordsW / utterW)
print()
print("  ", "total words spoken by men:", wordsM)
print("  ", "total number of utterances:", utterM)
print("  ", "the average number of words per utterance was :",
      wordsM / utterM)
print()
print("  ", "total words unaccounted for by gender:", wordsN)
print("  ", "total number of utterances unaccounted for:", utterN)
print()

need to add some stuff in to count the number of speakers of each gender.

having a bit of trouble with the if/else statements here and the nesting. I’m going to build a code for only one file, extend it to two, and then run it on the entire corpus once I can differentiate between the ‘A-f:’ in one file, and the ‘A-f:’ in the next file in the list. I’m also assuming that there are never any repeated speakers across files.
Here are the counts for fe_03_06500.txt to compare the code against:
(there is 1 female speaker, and 1 male speaker.
grep_speakerID_06500

here is my speaker count abstraction (I’m sure there’s an easier way to do this):

# --- babyHW1.py --- #

# --- instantiate --- #
a = 13 	#w
b = 0	#w
c = 2	#m
d = 9	#m
e = 0	#n

w = 0
m = 0
n = 0

# --- speaker count --- #
if a > 0:
	w += 1
else:
	w = 0

if b > 0:
	w += 1
else:
	w = 0

if c > 0:
	m += 1
else:
	m = 0

if d > 0:
	m += 1
else:
	m = 0

if e > 0:
	n += 1
else:
	n = 0

# --- output --- #
print("w:", w)
print("m:", m)
print("n:", n)

Exercise 4



Exercise 4

In class, we have seen how to read from a file. Here is what the code looks like so far:

#------------initialization of tracking variables-------------------
totalWordsSpoken = 0
totalUtterances = 0

#------------open the file-------------------------

fisherFile = open("Fisher/065/fe_03_06500.txt")

#-----------processing block---------------------

for line in fisherFile:
        #list of the items in the line
        words = line.split()

        print("Here are all the words", words)

        #extracting speaker ID
        speaker = words[2]

        #actual words uttered by the speaker
        actualWords = words[3:]

        print("the sentence is spoken by", speaker)
        print("their actual utterance was", actualWords)
        print("the sentence has", len(actualWords), "words")

        totalWordsSpoken += len(actualWords)
        totalUtterances += 1

#---------done with all the sentences; post-analysis----------

print("the total number of words spoken was", totalWordsSpoken)
print("the total number of utterances was", totalUtterances)
print("the average number of words per utterance was",
      totalWordsSpoken / totalUtterances)

Now we want to keep track of the gender information too. We want the total of words and the total of utterances uttered by women as well as the total of words and the total of utterances uttered by men. Look at the notes and adapt your code to use a “if statement” to do so. Make sure your code runs ;-) The notes give the results.

wrote this code in class (see Python scripting)

# ----- initialization of tracking variables ----- #
totalWordsSpoken = 0
totalUtterances = 0

wordsW = 0
wordsM = 0
wordsN = 0

utterW = 0
utterM = 0
utterN = 0

# ----- opening file ----- #

fF = open("../Downloads/Fisher/065/fe_03_06500.txt")


# -----	for loop ----- #
for sentence in fF:
	
    # -----	processing block ----- #
    words = sentence.split()
    onlywords = words[3:]
    genderLetter = words[2][2]
    speakerID = words[2][0]
    numberwords = len(onlywords)
    onlysen = " ".join(onlywords)
	
    # -----	count ----- #
    totalWordsSpoken += numberwords
    totalUtterances += 1

    # -----	gender output ----- #
    if genderLetter == 'f':
        gender = "woman"
        wordsW += numberwords
        utterW += 1
    elif genderLetter == 'm':
        gender = "man"
        wordsM += numberwords
        utterM += 1
    else:
        gender = "non-gendered person"
        wordsN = numberwords
        utterN += 1

    # -----	output per sentence----- #
    print()
    print("  ", "sentence number", totalUtterances, ":", onlysen)
    print("  ", "number of words:", numberwords)
    print("  ", "words are:", onlywords)
    print("  ", "speaker ID:", speakerID)
    print("  ", "speaker is a:", gender)

# -----	final totals output ----- #

print()
print("  ", "total number of words:", totalWordsSpoken)
print("  ", "total number of utterances:", totalUtterances)
print("  ", "the average number of words per utterance was :",
      totalWordsSpoken / totalUtterances)
print()
print("  ", "total words spoken by women:", wordsW)
print("  ", "total number of utterances:", utterW)
print("  ", "the average number of words per utterance was :",
      wordsW / utterW)
print()
print("  ", "total words spoken by men:", wordsM)
print("  ", "total number of utterances:", utterM)
print("  ", "the average number of words per utterance was :",
      wordsM / utterM)
print()
print("  ", "total words unaccounted for by gender:", wordsN)
print("  ", "total number of utterances unaccounted for:", utterN)
print()

which returns this output on the macs at school (haven’t tried it at home yet)
Screen Shot 2014-05-08 at 11.54.15 AM

Python sentences

class notes [pdf]

Python (if installed properly) always tells the terminal where it is located, so, relative to Python, the program doesn’t need to be in the same directory, or even referenced in relation to it.

Screen Shot 2014-05-07 at 10.32.06 AM

*/ — set up Cygwin on the Dellosaurus to do the same thing, if possible? — /*

First program (in Xcode on mac):
Screen Shot 2014-05-07 at 11.09.15 AM

returns this output in Python 3.3.4 in Terminal:
Screen Shot 2014-05-07 at 11.09.29 AM

because the program is processed in a linear order, we can change the value of “sentence” at any point after where it was initially set
Screen Shot 2014-05-07 at 12.57.14 PM

but if you don’t redefine the other variables, it returns:
Screen Shot 2014-05-07 at 12.37.47 PM

redefining the variables for the second sentence:
Screen Shot 2014-05-07 at 12.53.57 PM

you get the correct, expected output, and the correct number count:
Screen Shot 2014-05-07 at 12.41.31 PM

I wonder if there is a more concise way to do this, or maybe I’m doing it incorrectly?

the for loop
unlike other programming languages, Python uses indentation to define the scope of the loop (not {} like some, etc.)

to start counting, we have to set the total to 0 so that we can mark it as an integer, and have a starting point

the loop runs in the interactive mode in Python:

# -- for loop
for number in [1, 2, 3]:
    number *= 2
    print(number)

# -- processing 1
number = 1
number *= 2
print(number)

# -- processing 2
number = 2
number *= 2
print(number)

# -- processing 3
number = 3
number *= 2
print(number)

in class programs:
sentence.py

#this is a note not seen by Python

# -- first test
# print("   ")
# print("hello world!")

# -- doing it the long way (sentence 1)
# print("   ")
# print("Here are all the words:", "B-f: I'm in graduate school".split())
# print("This sentence has", len("B-f: I'm in graduate school".split()), "words")

# print("   ")
# print("If we remove the identifier at the beginning, then...")
# print("   ")
# print("Here are all the words:", "B-f: I'm in graduate school".split()[1:])
# print("This sentence has", len("B-f: I'm in graduate school".split()[1:]), "words")

# print("   ")
# print("   ", "assigning variables to the things that we've just input ")
# print("   ", "it's a good idea to give a sensible name to your variable")

# -- adding variables (sentence 1)

print("   ")
sentence = "B-f: I'm in graduate school"
words = sentence.split()
onlywords = words[1:]
gender = words[0][2]
speakerID = words[0][0]
numberwords = len(onlywords)

# print("all word-like things:", words)
# print("   ")
# print("the words", onlywords, "were spoken by a", gender, "person")

print("  ", "for the sentence,", sentence)
print("  ")
print("  ", "number of words:", numberwords)
print("  ", "words are:", onlywords)
print("  ", "speaker ID:", speakerID)
print("  ", "speaker is:", gender)

# print("   ")
# print("  * --", "figure out how to print something based on the input", "-- *")
# print("  * --", "like, return 'Female' for the 'gender' variable 'f'", "-- *")
# print("   ", "")


# -- looking at a second sentence (sentence 2): changing the value of "sentence", and re-running everything
# -- taking the whole thing out of the print (after the fact)

print("   ")
sentence = "B-f: at OSU"
words = sentence.split() # only need to redefine the "words" variable
# -- don't need to redefine the other variables

print("  ", "for the sentence,", sentence)
print("  ")
print("  ", "number of words:", numberwords)
print("  ", "words are:", onlywords)
print("  ", "speaker ID:", speakerID)
print("  ", "speaker is:", gender)

print("  ")

sentence2.py



# -- initialization of tracking variables
totalWordsSpoken = 0
totalUtterances = 0

# -- adding variables (sentence 1)

print("   ")
sentence = "B-f: I'm in graduate school"
words = sentence.split()
onlywords = words[1:]
gender = words[0][2]
speakerID = words[0][0]
numberwords = len(onlywords)

totalWordsSpoken += len(onlywords)
totalUtterances += 1

print("  ", "sentence 1")
print("  ", "number of words:", numberwords)
print("  ", "words are:", onlywords)
print("  ", "speaker ID:", speakerID)
print("  ", "speaker is:", gender)

# -- sentence 2

print("   ")
sentence = "B-f: at OSU"
words = sentence.split() # define these variables again
onlywords = words[1:]
gender = words[0][2]
speakerID = words[0][0]
numberwords = len(onlywords)

totalWordsSpoken += len(onlywords)
totalUtterances += 1

print("  ", "sentence 2")
print("  ", "number of words:", numberwords)
print("  ", "words are:", onlywords)
print("  ", "speaker ID:", speakerID)
print("  ", "speaker is:", gender)

# -- final output


print("  ")
print("  ", "total number of words:", totalWordsSpoken)
print("  ", "total number of utterances:", totalUtterances)
print("  ", "the average number of words per utterance was :",
            totalWordsSpoken / totalUtterances)

print("  ")

Exercise 2

Exercise 2

(a) Quantifying the data in the Fisher corpus

How many utterances and words do we have in the Fisher corpus on Carmen?

files in Fisher:
wc_Fisher

wcs in Fisher:
065
wc_065_total_words

058
wc_058_total_words

total lines/”utterances” in Fisher (math in Python):
wc_Fisher_total_lines

total words in Fisher (math in Python):
wc_Fisher_total_words

29734 utterances
362020 words


in class
forgot to subtract the time-stamps and the markers for the words — perhaps the best way to do this would be to multiply the number of total number lines by three, and then subtract those non-words (three in each line) from the total number of words.

(b) Do people laugh?

What is the percentage of utterances containing laughter, totaling all the files of the “058” directory?

utterances containing laughter in Fisher (directions from the pdf):
wc_Fisher_utterances(pdf)

from Cygwin:
wc_Fisher_utterances

the number of lines 1681 is the number of “utterances” with laughter in all of Fisher

utterances in only 058:
wc_058_total_words

utterances with laughter in only 058:
wc_058_laughter
using different commands returned the same number of lines, but different numbers of bytes. Is it the “-R” that changed it?

10556 utterances in 058
199 utterances in 058 containing [laughter]

percentage of utterance containing [laughter] in 058 (math in Python):
wc_058_percent_laughter

approx 1.9% of the utterances in 058 contain laughter.

hackarooni



this is a changelog to figure out how to keep the blog hacked.

So, here is the stylesheet I’ve introduced (mega-basic)

<head>
	<style>
		body { font-family: Lucida Console,Monaco,monospace; }
		a { text-decoration: none; }
		pre, blockquote { font-size: 13px; }
	</style>
</head>

It overrides a few elements, but still needs <blockuote> to be styled inline.
It also doesn’t extend font-size to the footer, or text-decoration to the sidebar links (although the calendar does lose it’s underline)

I tried to put the whole thing on one line, attempting to remove the giant gap at the top, but it doesn’t work. the <head> tag is the culprit. I think I’ll need to adjust the margins, but I should look into the original stylesheet first. OR… hack away!

added:

margin-top { font-size: -10px; }

It seemed to work, but not enough. Changed to -100px. Let’s see. Nuthin’. Moved the margin-top within the body identifier. And?? not much. -300px? Oh my! I chopped off the whole top of the blog! Check it out:
Screenshot

let’s try padding instead. (I can’t remember, is that only for HTML tables?). Nuthin’. How do I hide the <head> tag then? This is a job for later. Other things to do. Taking the whole margin thing out.

Or maybe wrap the whole thing in a <div>? As an idea, and move it up… but since <head> is technically inside the &tl;body>, that makes things odd.

HERE is the global stylesheet, so I know where to start hijacking… begin with the GIgantic h1 and h2 perhaps?

Also, later, will want to explore:

  • if it’s possible to have the entire semester on one blog
  • how to export it to WordPress after?
  • post blog posts inside the pages (these two things are linked)
  • which other of the HTML tags does it acknowledge?
  • how to keep it from changing ... into …
    • without having to put it into <code> tags
    • what does <pre> look like?
      ...

      kinda what I thought.

the following is a test of the “close tags” feature:

  • will it do the whole list?
  • let us see…
  • NOPE – didn’t do a thing with all my open <li>s

how to take a screenshot

make sure when adding images to the site, the size is set to “full-size”, and the alignment is “none”. If the alignment is set at all, the words will wrap, and it will be impossible to start a new paragraph just below. But, if wrapping is what you’re after… go for it.

earlier ideas

  • it may be necessary to add this to the top of each post to get the font to stay monospaced, but stay tuned. early days.
  • (I put it in this post only, but it seems to hijack the whole <body> tag, sitewide.)
  • It may be that it only works on the “home” page — need to test that out.

  • Update: Dang. ’tis true. I need to have one post in each tag, category, and on each page to keep it consistent across the site. (added it to the “Unix navigation” page).
  • Oooh, maybe I can add every tag to “hackarooni”… but it might be ugly. humph.
    Update2: nope – need to put it in every post, so that if you click on just the link to that post, it’ll stay formatted. Rrrrrr.

  • add to the top of each post (wish I could <link> a stylesheet!)
    — what if I tried to hijack the <head> tag also? hmmmm. —

    <body style="font-family:Lucida Console,Monaco,monospace;">
    
  • also, an improved font size for the preformatted and blockquote texts
    and a cleanup for links that need such

    <pre style="font-size:13px;">
    <blockquote style="font-size:13px;">
    <a style="text-decoration:none;">
    

and here, HERE, friends, we will test the special characters *crosses fingers* c’moooonnnn!!

first, outside the preformatting : ə æ g͡bɛ k͡pɪŋɑ

and then, within :

ə æ g͡bɛ k͡pɪŋɑ

NICE, WordPress!
here, one invoking a special (but still webbish) font — okay, so it’s Times New Roman, but whatever, and 18px font-size:

ə æ g͡bɛ k͡pɪŋɑ