Exercise 2

Exercise 2

(a) Quantifying the data in the Fisher corpus

How many utterances and words do we have in the Fisher corpus on Carmen?

files in Fisher:
wc_Fisher

wcs in Fisher:
065
wc_065_total_words

058
wc_058_total_words

total lines/”utterances” in Fisher (math in Python):
wc_Fisher_total_lines

total words in Fisher (math in Python):
wc_Fisher_total_words

29734 utterances
362020 words


in class
forgot to subtract the time-stamps and the markers for the words — perhaps the best way to do this would be to multiply the number of total number lines by three, and then subtract those non-words (three in each line) from the total number of words.

(b) Do people laugh?

What is the percentage of utterances containing laughter, totaling all the files of the “058” directory?

utterances containing laughter in Fisher (directions from the pdf):
wc_Fisher_utterances(pdf)

from Cygwin:
wc_Fisher_utterances

the number of lines 1681 is the number of “utterances” with laughter in all of Fisher

utterances in only 058:
wc_058_total_words

utterances with laughter in only 058:
wc_058_laughter
using different commands returned the same number of lines, but different numbers of bytes. Is it the “-R” that changed it?

10556 utterances in 058
199 utterances in 058 containing [laughter]

percentage of utterance containing [laughter] in 058 (math in Python):
wc_058_percent_laughter

approx 1.9% of the utterances in 058 contain laughter.

Python intro


class notes [pdf]

python tutorial

notes
need to use escape characters for some elements
if you are using single quotes for something other than delimiting the string, it needs to be marked as unique — not a command

>>> 'I\'m in graduate school'

length = takes an argument and returns the length of the string

>>> len("hello")
5

len = length
int = integer
str = string (by convention, we put a string between quotes – single or double)

type = will return one of these data types

>>> type("hello")
<class 'str'>

the dot ” . ” connects the string with

>>> "I'm in graduate school".split()
["I'm", 'in', 'graduate', 'school']
>>> len(["I'm", 'in', 'graduate', 'school'])
4
>>> len("I'm in graduate school".split())
4

in lists, the elements are ordered, Python starts with [0]
The number in brackets is called the index, and points to a specific element within the list

>>> "I'm in graduate school".split()[3]
'school'
>>> 

here’s an exciting oddity that I accidentally discovered:

>>> "I'm in graduate school".split()[2][3]
'd'

the second index returns the third character within the second element split from the string

Ctrl+C will escape you from Python back to cmd (or Terminal, as the case may be)

Unix counting


class notes [pdf]

more v less — there is less difference between them on the mac, but you can still see the output in Terminal after you quit if you use more, but the output disappears if you use less

-lh = lists the files with the additional metadata (time created, etc.)

wc shows:

lines  words  bytes  file_name.txt

Screen Shot 2014-05-06 at 9.59.05 AM

And from the manual:
Terminal_man_wc

| “pipe” – from the pdf :

We use the vertical bar “|” to connect two commands together so that the output from one command becomes the input of the next command. Two (or more) commands connected in this way form what’s called a pipe.

regular expression

xkcd
regular expression games

grep = going to count the number of lines that some element we are looking for appears in
useful for looking for a word in a file
to get out, type Ctrl+C or Ctrl+D

grep returns the number of lines, and when you conjoin it with “wc” ( via | ), it counts the number of words in each line that contains the expression that you were searching for, NOT the number of instances of the expression itself.

grep –color=auto”[laughter]” will return each character that matches those in brackets. (the brackets have a particular function in regular)
if you type “grep” in with nothing else, you get a list of available commands.