Clippers Tuesday: Denis Newman-Griffis on concept embeddings

Word representations are a key technology in the NLP toolbox, but extending their success into representations of phrases and knowledge base entities has proven challenging. In this talk, I will present a method for jointly learning embeddings of words, phrases, and entities from uannotated text, using only a list of mappings between entities and surface forms. I compare these against prior methods that have relied on explicitly annotated text or the rich structure of knowledge graphs, and show that our learned embeddings better capture similarity and relatedness judgments and some relational domain knowledge.

I will also discuss experiments on augmenting the embedding model to learn soft entity disambiguation from contexts, and using member words to augment the learning of phrases. These additions harm model performance on some evaluations, and I will show some preliminary analysis of why the specific modeling approach for these ideas may not be the right one. I hope to brainstorm ideas on how to better model joint phrase-word learning and contextual disambiguation, as part of ongoing work.

Clippers Tuesday: Adam Stiff on Transferring question answering models for virtual patients from written to spoken domain

Virtual patients are an effective, cost-efficient tool for training medical professionals to interview patients in a standardized environment. Technological limitations have thus far limited these tools to typewritten interactions; however, as speech recognition systems have improved, full-scale deployment of a spoken dialogue system for this purpose has edged into the range of feasibility. To build the best such system possible, we propose to take advantage of work done to improve the functioning of virtual patients in the typewritten domain. Specifically, our approach is to noisily map spoken utterances into text using off-the-shelf speech recognition, whereupon the text can be used to train existing question classification architectures. We expect that phoneme-based CNNs may mitigate recognition errors in the same way that character-based CNNs mitigate e.g., spelling errors in the typewritten domain. In this talk I will present the architecture of the system being developed to collect speech data, the experimental design, and some baseline results.

Clippers Today: David King on lexical paraphrasing

Automatic paraphrasing with lexical substitution

Generating automatic paraphrases with lexical substitution is a difficult task, but can be useful to supplement data in domain specific machine learning tasks. The Virtual Patient Project is an exact example of this problem, where have limited domain specific training data but need to accurately identify a user’s intended question, an example of which we may have only seen once. In this talk, I will present the progress Amad Hussein, Michael White, and I have made in automatically generating paraphrases, using unsupervised lexical substitution with WordNet, word embeddings, and the Paraphrase Database. Although currently our oracle accuracy in automatically classifying question types is only moderately above our baseline, they are modestly significant and give an estimate of what can be accomplished with human filtering. We propose future work in this direction that utilizes machine translation and phrase level substitution.

Clippers Tuesday: Deblin Bagchi on mimic loss for robust ASR

For the task of speech enhancement, local learning objectives are agnostic to phonetic structures helpful for speech recognition. We propose to add a global criterion to speech enhancement that allows the model to learn these high-level abstractions. We first train a spectral classifier on clean speech to predict senone labels. Then, the spectral classifier is joined with our speech enhancer as a noisy speech recognizer. This model is taught to mimic the output of the spectral classifier alone on clean speech. This mimic loss is combined with the traditional local criterion to train the speech enhancer to produce de-noised speech. Feeding the de-noised speech to an off-the-shelf Kaldi training recipe for the CHiME-2 corpus shows significant improvements in Word Error Rate (WER).

Clippers Tuesday: Micha Elsner on Saccadic Models for Referring Expression Generation

“Saccadic models for referring expression generation”

Referring expression generation (REG) is the task of describing an object in a scene so that an observer can pick it out. We have many experimental results showing that REG is constrained by the sequential nature of human vision (that is, the human eye cannot take in the whole image at once, but must look from place to place— saccade— to see more parts of the image clearly). Yet current neural network models for computer vision begin precisely by analyzing the entire image at once; thus, they cannot be used directly as models of the human REG algorithm. A recent model for computer vision (Mnih et al 2014) has a limited field of vision and makes saccades around the image; I propose to adapt this model to the REG task and use it as a psycholinguistic model of human processing. I will present some background literature, a pilot model architecture and results on some contrived tasks with synthetic data. I will discuss possible ways forward for the model and hope to get some interesting feedback from the group.

Clippers Tuesday: Cory Shain on Incremental Semantics and Reading Time

Title: Evidence of semantic processing difficulty in naturalistic reading

Language is a powerful vehicle for conveying our thoughts to others and inferring thoughts from their utterances. Much research in sentence processing has investigated factors that affect the relative difficulty of processing each incoming word during language comprehension, including in rich naturalistic materials. However, in spite of the fact that language is used to convey and infer meanings, prior research has tended to focus on lexical and/or structural determinants of comprehension difficulty. This focus has plausibly been due to the fact that lexical and syntactic properties can be accurately estimated in an automatic fashion from corpora or using high-accuracy automatic incremental parsers. Comparable incremental semantic parsers are currently lacking. However, recent work in machine learning has found that distributed representations of word meanings — based on patterns of lexical co-occurrence — contain a substantial amount of semantic information, and predict human behavior on a wide range of semantic tasks. To examine the effects of semantic relationships among words on comprehension difficulty, we estimated a novel measure — incremental semantic relatedness — for three naturalistic reading time corpora: Dundee, UCL, and Natural Stories. In particular, we embedded all three corpora using GloVe vectors pretrained on the 840B word Common Crawl dataset, then computed the mean vector distance between the current word and all content words preceding it in the sentence. This provides a measure of a word’s semantic relatedness to the words that precede it without requiring the construction of carefully normed stimuli, permitting us to evaluate semantic relatedness as a predictor of comprehension difficulty in a broad-coverage setting. We found a significant positive effect of mean cosine distance on reading time duration in each corpus, over and above linear (5-gram) and syntactic (PCFG) models of linguistic expectation. Our results are consistent with at least two (perhaps complementary) interpretations. Semantically related context might facilitate processing of the target word through spreading activation. Or vector distances might approximate the surprisal values of a semantic component of the human language model, thus yielding a rough estimate of semantic surprisal. Future advances in incremental semantic parsing may permit more precise exploration of these possibilities.

Clippers Tuesday: Wuwei Lan on Continuously Growing Sentential Paraphrases

At Clippers on Tuesday, Wuwei Lan will be presenting his EMNLP 2017 paper (with Wei Xu)

Title: Automatic Paraphrase Collection and Identification in Twitter


Paraphrase is a restatement of the meaning of a text or passage using other words, which is helpful in many NLP applications, including machine translation, question answering, semantic parsing and textual similarity. Paraphrase resource is valuable and important, but it is hard to get at large scale, especially for sentence level paraphrases. Here we propose a smart way to automatically collect enormous sentential paraphrases from Twitter, which is simply grouping tweets through shared URLs. We gave the largest human-labeled golden corpus of 51,524 pairs, as well as a silver standard corpus which can grow 30k pairs per month with 70% precision. Based on this paraphrase dataset from Twitter, we experimented with deep learning models for automatic paraphrase identification. We find that without pretrained word embedding, we can still achieve state-of-the-art or more competitive results on social media dataset with only character or subword embedding, which is useful in domain with more out-of-vocabulary words or more spelling variations.

Clippers Tuesday: Denis Newman-Griffis on second-order word embeddings

At Clippers Tuesday, Denis Newman-Griffis will be presenting his work looking at the topological structure of word embeddings and how that info can (or can’t) be used downstream.


Word embeddings are now one of the most common tools in the NLP toolbox, and we have a good sense of how to train them, tune them, and apply them effectively. However, the structure of how they encode the information used in downstream applications is much less well-understood. In this talk, I present work analyzing nearest neighborhood topological structures derived from trained word embeddings, discarding absolute feature values and maintaining only the relative organization of points. These structures exhibit several interesting properties, including high variance in the organization of neighborhood graphs derived from embeddings trained on the same corpus with different random initializations. Additionally, I show that graph node embeddings trained over the nearest neighbor graph can be substituted for the original word embeddings in both deep and shallow downstream models for named entity recognition and paraphrase detection, with only a small loss to accuracy and even an increase in recall in some cases. While these graph node embeddings suffer from the same issue of high variance due to random initializations, they exhibit some interesting properties of their own, including generating a higher density point space, remarkably poor performance on analogy tasks, and preservation of similarity at the expense of relatedness.

Clippers Tuesday: Adam Stiff on domain adaptation for question answering systems

At Clippers on Tuesday, Adam Stiff will present on domain adaptation for question answering systems.

Abstract: On Tuesday I’ll be presenting a proposal for a strategy to train new virtual patients, which would ideally allow an educator to instantiate a new patient from one set of question-answer pairs. The idea follows some fairly recent work in one-shot learning, and the aim will be to leverage much larger corpora to try to encourage semantically similar questions to be close together in some representation space, to limit the need for extensive training for a new model. The idea is still evolving, so the talk will be very informal, and I hope to get feedback and suggestions about related research that I should be reading, potential pitfalls, extensions, etc.

Clippers Tuesday: Cory Shain on deconvolutional time series regression

Title: Deconvolutional time series regression: A technique for modeling temporally diffuse effects

Abstract: This talk proposes Deconvolutional Time Series Regression (DTSR), a general-purpose regression technique for modeling sequential data in which effects can reasonably be assumed to be temporally diffuse. DTSR jointly learns linear effect estimates and temporal convolution parameters from parallel temporal sequences of dependent variable(s) and independent variable(s), using the convolution function to assign time-varying weight to the history of each independent variable in computing the prediction for a given regression target. DTSR successfully recovers true latent convolution functions from synthetic data, and on real-world data from several psycholinguistic experiments DTSR both (1) significantly outperforms competing approaches in terms of prediction error on unseen data and (2) provides plausible, fine-grained, and fairly modality-invariant estimates of the time-course of each regressor’s influence on the dependent measure. These results support the superiority of DTSR to standard modeling approaches like linear mixed-effects regression for a range of experiment types.

Authors: Cory Shain and William Schuler