Automatic paraphrasing with lexical substitution
Generating automatic paraphrases with lexical substitution is a difficult task, but can be useful to supplement data in domain specific machine learning tasks. The Virtual Patient Project is an exact example of this problem, where have limited domain specific training data but need to accurately identify a user’s intended question, an example of which we may have only seen once. In this talk, I will present the progress Amad Hussein, Michael White, and I have made in automatically generating paraphrases, using unsupervised lexical substitution with WordNet, word embeddings, and the Paraphrase Database. Although currently our oracle accuracy in automatically classifying question types is only moderately above our baseline, they are modestly significant and give an estimate of what can be accomplished with human filtering. We propose future work in this direction that utilizes machine translation and phrase level substitution.
For the task of speech enhancement, local learning objectives are agnostic to phonetic structures helpful for speech recognition. We propose to add a global criterion to speech enhancement that allows the model to learn these high-level abstractions. We first train a spectral classifier on clean speech to predict senone labels. Then, the spectral classifier is joined with our speech enhancer as a noisy speech recognizer. This model is taught to mimic the output of the spectral classifier alone on clean speech. This mimic loss is combined with the traditional local criterion to train the speech enhancer to produce de-noised speech. Feeding the de-noised speech to an off-the-shelf Kaldi training recipe for the CHiME-2 corpus shows significant improvements in Word Error Rate (WER).
“Saccadic models for referring expression generation”
Referring expression generation (REG) is the task of describing an object in a scene so that an observer can pick it out. We have many experimental results showing that REG is constrained by the sequential nature of human vision (that is, the human eye cannot take in the whole image at once, but must look from place to place— saccade— to see more parts of the image clearly). Yet current neural network models for computer vision begin precisely by analyzing the entire image at once; thus, they cannot be used directly as models of the human REG algorithm. A recent model for computer vision (Mnih et al 2014) has a limited field of vision and makes saccades around the image; I propose to adapt this model to the REG task and use it as a psycholinguistic model of human processing. I will present some background literature, a pilot model architecture and results on some contrived tasks with synthetic data. I will discuss possible ways forward for the model and hope to get some interesting feedback from the group.
Title: Evidence of semantic processing difficulty in naturalistic reading
Language is a powerful vehicle for conveying our thoughts to others and inferring thoughts from their utterances. Much research in sentence processing has investigated factors that affect the relative difficulty of processing each incoming word during language comprehension, including in rich naturalistic materials. However, in spite of the fact that language is used to convey and infer meanings, prior research has tended to focus on lexical and/or structural determinants of comprehension difficulty. This focus has plausibly been due to the fact that lexical and syntactic properties can be accurately estimated in an automatic fashion from corpora or using high-accuracy automatic incremental parsers. Comparable incremental semantic parsers are currently lacking. However, recent work in machine learning has found that distributed representations of word meanings — based on patterns of lexical co-occurrence — contain a substantial amount of semantic information, and predict human behavior on a wide range of semantic tasks. To examine the effects of semantic relationships among words on comprehension difficulty, we estimated a novel measure — incremental semantic relatedness — for three naturalistic reading time corpora: Dundee, UCL, and Natural Stories. In particular, we embedded all three corpora using GloVe vectors pretrained on the 840B word Common Crawl dataset, then computed the mean vector distance between the current word and all content words preceding it in the sentence. This provides a measure of a word’s semantic relatedness to the words that precede it without requiring the construction of carefully normed stimuli, permitting us to evaluate semantic relatedness as a predictor of comprehension difficulty in a broad-coverage setting. We found a significant positive effect of mean cosine distance on reading time duration in each corpus, over and above linear (5-gram) and syntactic (PCFG) models of linguistic expectation. Our results are consistent with at least two (perhaps complementary) interpretations. Semantically related context might facilitate processing of the target word through spreading activation. Or vector distances might approximate the surprisal values of a semantic component of the human language model, thus yielding a rough estimate of semantic surprisal. Future advances in incremental semantic parsing may permit more precise exploration of these possibilities.
At Clippers on Tuesday, Wuwei Lan will be presenting his EMNLP 2017 paper (with Wei Xu)
Title: Automatic Paraphrase Collection and Identification in Twitter
Paraphrase is a restatement of the meaning of a text or passage using other words, which is helpful in many NLP applications, including machine translation, question answering, semantic parsing and textual similarity. Paraphrase resource is valuable and important, but it is hard to get at large scale, especially for sentence level paraphrases. Here we propose a smart way to automatically collect enormous sentential paraphrases from Twitter, which is simply grouping tweets through shared URLs. We gave the largest human-labeled golden corpus of 51,524 pairs, as well as a silver standard corpus which can grow 30k pairs per month with 70% precision. Based on this paraphrase dataset from Twitter, we experimented with deep learning models for automatic paraphrase identification. We find that without pretrained word embedding, we can still achieve state-of-the-art or more competitive results on social media dataset with only character or subword embedding, which is useful in domain with more out-of-vocabulary words or more spelling variations.
At Clippers Tuesday, Denis Newman-Griffis will be presenting his work looking at the topological structure of word embeddings and how that info can (or can’t) be used downstream.
Word embeddings are now one of the most common tools in the NLP toolbox, and we have a good sense of how to train them, tune them, and apply them effectively. However, the structure of how they encode the information used in downstream applications is much less well-understood. In this talk, I present work analyzing nearest neighborhood topological structures derived from trained word embeddings, discarding absolute feature values and maintaining only the relative organization of points. These structures exhibit several interesting properties, including high variance in the organization of neighborhood graphs derived from embeddings trained on the same corpus with different random initializations. Additionally, I show that graph node embeddings trained over the nearest neighbor graph can be substituted for the original word embeddings in both deep and shallow downstream models for named entity recognition and paraphrase detection, with only a small loss to accuracy and even an increase in recall in some cases. While these graph node embeddings suffer from the same issue of high variance due to random initializations, they exhibit some interesting properties of their own, including generating a higher density point space, remarkably poor performance on analogy tasks, and preservation of similarity at the expense of relatedness.
At Clippers on Tuesday, Adam Stiff will present on domain adaptation for question answering systems.
Abstract: On Tuesday I’ll be presenting a proposal for a strategy to train new virtual patients, which would ideally allow an educator to instantiate a new patient from one set of question-answer pairs. The idea follows some fairly recent work in one-shot learning, and the aim will be to leverage much larger corpora to try to encourage semantically similar questions to be close together in some representation space, to limit the need for extensive training for a new model. The idea is still evolving, so the talk will be very informal, and I hope to get feedback and suggestions about related research that I should be reading, potential pitfalls, extensions, etc.
Title: Deconvolutional time series regression: A technique for modeling temporally diffuse effects
Abstract: This talk proposes Deconvolutional Time Series Regression (DTSR), a general-purpose regression technique for modeling sequential data in which effects can reasonably be assumed to be temporally diffuse. DTSR jointly learns linear effect estimates and temporal convolution parameters from parallel temporal sequences of dependent variable(s) and independent variable(s), using the convolution function to assign time-varying weight to the history of each independent variable in computing the prediction for a given regression target. DTSR successfully recovers true latent convolution functions from synthetic data, and on real-world data from several psycholinguistic experiments DTSR both (1) significantly outperforms competing approaches in terms of prediction error on unseen data and (2) provides plausible, fine-grained, and fairly modality-invariant estimates of the time-course of each regressor’s influence on the dependent measure. These results support the superiority of DTSR to standard modeling approaches like linear mixed-effects regression for a range of experiment types.
Authors: Cory Shain and William Schuler
At Clippers Tuesday Lifeng Jin will present:
Unsupervised Grammar Induction with Depth-bounded PCFG
There has been recent interest in applying cognitively or empirically motivated bounds on recursion depth to limit the search space of grammar induction models. In this talk, I will present a Bayesian grammar induction model which extends this depth-bounding approach to probabilistic context-free grammar induction (DB-PCFG), which has a smaller parameter space than hierarchic sequence models, and therefore more fully exploits the space reductions of depth-bounding.
Results for this model on grammar acquisition from a synthetic dataset and transcribed child-directed speech exceed those of other models when evaluated on parse accuracy. Moreover, grammars acquired from this model demonstrate a consistent use of category labels, something which has not been demonstrated by other acquisition models.
At Clippers Tuesday, Zhen Wang will present joint work with Huan Sun on separating code from natural language text.
Title: Separating Text and Code for Next Utterance Classification in Stack Overflow
Abstract: In this talk, we will discuss our ongoing work on (1) developing tools to separate natural language text and programming code in a Stack Overflow (SO) comment, and (2) applying them to the Next Utterance Classification (NUC) task. In SO, a comment is posted after a question or answer post, and usually contains much information about follow-up questions, suggestions, opinions, etc. It is often a mixture of two different modalities: natural language and programming language, which distinguishes itself from other comments on social media like Twitter and Facebook. Such bi-modal mixture property makes it more difficult for machine to understand. We hypothesize that separating code and natural text should be the first step for tasks involving understanding programming-related text. While careful comment writers may use special formatting to distinguish natural words and programming tokens, noisy SO comments like “You will first need to: import collections # to use defaultdict” that simply mix text and code together are also very common. Therefore, in our first task, we study automatically separating code and text in noisy SO comments, which is casted as a sequence labeling problem. In our preliminary experiments, we tested a series of baseline models including traditional CRF with hand-crafted features and the state-of-the-art neural methods for NER task. Our results show that for tokens that can appear in both programming and natural language context, such as “exception”, “timeout”, and “flatten”, the baseline models cannot make accurate predictions of their labels. We are trying to improve the baseline models using domain-specific knowledge as well as more advancedneural architectures.
In our second task, we investigate whether separately modeling text and code can help the Next Utterance Classification (NUC) task on SO comments, which is to classify whether an utterance is a response to another. For training/validating/testing models, we design special rules to collect context-response pairs on Stackoverflow comments containing both natural language and code snippets.Siamese networks with tied Bi-LSTM were implemented for NUC task, with and without code snippets treated differently from natural text. Beyond the current work, our research plan is to mine the rich resources in Stack Overflow, understand text-code mixed data, and develop programming related intelligent assistants in the long run.
Any suggestions and comments are highly appreciated.