For the task of speech enhancement, local learning objectives are agnostic to phonetic structures helpful for speech recognition. We propose to add a global criterion to speech enhancement that allows the model to learn these high-level abstractions. We first train a spectral classifier on clean speech to predict senone labels. Then, the spectral classifier is joined with our speech enhancer as a noisy speech recognizer. This model is taught to mimic the output of the spectral classifier alone on clean speech. This mimic loss is combined with the traditional local criterion to train the speech enhancer to produce de-noised speech. Feeding the de-noised speech to an off-the-shelf Kaldi training recipe for the CHiME-2 corpus shows significant improvements in Word Error Rate (WER).
Month: January 2018
Clippers Tuesday: Micha Elsner on Saccadic Models for Referring Expression Generation
“Saccadic models for referring expression generation”
Referring expression generation (REG) is the task of describing an object in a scene so that an observer can pick it out. We have many experimental results showing that REG is constrained by the sequential nature of human vision (that is, the human eye cannot take in the whole image at once, but must look from place to place— saccade— to see more parts of the image clearly). Yet current neural network models for computer vision begin precisely by analyzing the entire image at once; thus, they cannot be used directly as models of the human REG algorithm. A recent model for computer vision (Mnih et al 2014) has a limited field of vision and makes saccades around the image; I propose to adapt this model to the REG task and use it as a psycholinguistic model of human processing. I will present some background literature, a pilot model architecture and results on some contrived tasks with synthetic data. I will discuss possible ways forward for the model and hope to get some interesting feedback from the group.
Clippers Tuesday: Cory Shain on Incremental Semantics and Reading Time
Title: Evidence of semantic processing difficulty in naturalistic reading
Language is a powerful vehicle for conveying our thoughts to others and inferring thoughts from their utterances. Much research in sentence processing has investigated factors that affect the relative difficulty of processing each incoming word during language comprehension, including in rich naturalistic materials. However, in spite of the fact that language is used to convey and infer meanings, prior research has tended to focus on lexical and/or structural determinants of comprehension difficulty. This focus has plausibly been due to the fact that lexical and syntactic properties can be accurately estimated in an automatic fashion from corpora or using high-accuracy automatic incremental parsers. Comparable incremental semantic parsers are currently lacking. However, recent work in machine learning has found that distributed representations of word meanings — based on patterns of lexical co-occurrence — contain a substantial amount of semantic information, and predict human behavior on a wide range of semantic tasks. To examine the effects of semantic relationships among words on comprehension difficulty, we estimated a novel measure — incremental semantic relatedness — for three naturalistic reading time corpora: Dundee, UCL, and Natural Stories. In particular, we embedded all three corpora using GloVe vectors pretrained on the 840B word Common Crawl dataset, then computed the mean vector distance between the current word and all content words preceding it in the sentence. This provides a measure of a word’s semantic relatedness to the words that precede it without requiring the construction of carefully normed stimuli, permitting us to evaluate semantic relatedness as a predictor of comprehension difficulty in a broad-coverage setting. We found a significant positive effect of mean cosine distance on reading time duration in each corpus, over and above linear (5-gram) and syntactic (PCFG) models of linguistic expectation. Our results are consistent with at least two (perhaps complementary) interpretations. Semantically related context might facilitate processing of the target word through spreading activation. Or vector distances might approximate the surprisal values of a semantic component of the human language model, thus yielding a rough estimate of semantic surprisal. Future advances in incremental semantic parsing may permit more precise exploration of these possibilities.