Subcorpora
The Ohio State Stories Corpus includes two subcorpora:
- The Columbus subcorpus was recorded in Columbus, Ohio, in the Midland dialect region, shown in blue on the map below.
- The Ann Arbor subcorpus was recorded in Ann Arbor, Michigan, in the Northern dialect region, shown in green on the map below.
Talkers
The Columbus subcorpus includes 30 young adult talkers: 15 lifetime residents of the Midland dialect region (5 male, 10 female) and 15 lifetime residents of the Northern dialect region (5 male, 10 female).
The Ann Arbor subcorpus includes 15 young adult talkers: 15 lifetime residents of the Northern dialect region (5 male, 10 female).
All 45 talkers are monolingual native speakers of American English and ranged in age from 18-29 years old.
Materials
Each talker was recorded reading a set of 30 short stories twice. The stories were modeled on the smaller set of stories developed by Baker and Bradlow (2009). The first reading of the set of stories was produced in a “plain” lab style in which the talkers were instructed to imagine speaking to a friend. The second reading of the set of stories was produced in a “clear” lab style, in which the talkers were instructed to imagine speaking to a hearing-impaired or non-native listener.
Together, the 30 stories include 236 mostly monosyllabic target words that vary orthogonally in lexical frequency and phonological neighborhood density, as defined in the Hoosier Mental Lexicon (Nusbaum et al., 1984). Each target word appears twice in its story and the cloze probability of each mention of each target word was estimated in a separate cloze probability task (see Burdin et al., 2015).
The target words therefore vary factorially in lexical frequency, phonological neighborhood density, cloze probability, and discourse mention. These factors are fully crossed with speaking style, talker gender, and talker dialect in the Columbus subcorpus and with speaking style, talker gender, and recording location for the Northern talkers in the two subcorpora.
The Columbus subcorpus is fully described in: Burdin, R. S., Turnbull, R., & Clopper, C. G. (2015). Interactions among lexical and discourse characteristics in vowel production. Proceedings of Meetings on Acoustics, 22, 060005.
Annotation
The entire corpus has been forced-aligned using the Penn Phonetics Lab Forced Aligner (Yuan & Liberman, 2008). The alignments of the onset and offset of each target vowel have been hand-corrected.
Data Processing and Analysis
The data processing workflow for the corpus, including associated Praat and R scripts and sample data files, is available on the Ohio State Stories (OSS) Corpus Data Processing OSF repository.
Project Team
Principal investigator: Cynthia Clopper
Corpus design: Rory Turnbull, Abby Walker
Corpus collection: Rachel Steindel Burdin, Anna Crabb, McKenna Reeher, Rory Turnbull
Corpus annotation: Rachel Steindel Burdin, Anna Crabb, Megan Dailey, Nathanael Fath, Jessica Hanson, Erin Luthern, Sarah Mabie, Shannon Melvin, Rachel Monnin, Christine Prechtel
References
Baker, R. E., & Bradlow, A. R. (2009). Variability in word duration as a function of probability, speech style, and prosody. Language and Speech, 52, 391-413.
Burdin, R. S., Turnbull, R., & Clopper, C. G. (2015). Interactions among lexical and discourse characteristics in vowel production. Proceedings of Meetings on Acoustics, 22, 060005.
Nusbaum, H. C., Pisoni, D. B., & Davis, C. K. (1984). Sizing up the Hoosier mental lexicon: Measuring the familiarity of 20,000 words. Research on speech perception progress report no. 10 (pp. 357-376). Bloomington, IN: Speech Research Laboratory, Indiana University.
Yuan, J., & Liberman, M. (2008). Speaker identification on the SCOTUS corpus. Proceedings of Acoustics ’08, , 5687-5690.
