Title: Evidence of semantic processing difficulty in naturalistic reading
Language is a powerful vehicle for conveying our thoughts to others and inferring thoughts from their utterances. Much research in sentence processing has investigated factors that affect the relative difficulty of processing each incoming word during language comprehension, including in rich naturalistic materials. However, in spite of the fact that language is used to convey and infer meanings, prior research has tended to focus on lexical and/or structural determinants of comprehension difficulty. This focus has plausibly been due to the fact that lexical and syntactic properties can be accurately estimated in an automatic fashion from corpora or using high-accuracy automatic incremental parsers. Comparable incremental semantic parsers are currently lacking. However, recent work in machine learning has found that distributed representations of word meanings — based on patterns of lexical co-occurrence — contain a substantial amount of semantic information, and predict human behavior on a wide range of semantic tasks. To examine the effects of semantic relationships among words on comprehension difficulty, we estimated a novel measure — incremental semantic relatedness — for three naturalistic reading time corpora: Dundee, UCL, and Natural Stories. In particular, we embedded all three corpora using GloVe vectors pretrained on the 840B word Common Crawl dataset, then computed the mean vector distance between the current word and all content words preceding it in the sentence. This provides a measure of a word’s semantic relatedness to the words that precede it without requiring the construction of carefully normed stimuli, permitting us to evaluate semantic relatedness as a predictor of comprehension difficulty in a broad-coverage setting. We found a significant positive effect of mean cosine distance on reading time duration in each corpus, over and above linear (5-gram) and syntactic (PCFG) models of linguistic expectation. Our results are consistent with at least two (perhaps complementary) interpretations. Semantically related context might facilitate processing of the target word through spreading activation. Or vector distances might approximate the surprisal values of a semantic component of the human language model, thus yielding a rough estimate of semantic surprisal. Future advances in incremental semantic parsing may permit more precise exploration of these possibilities.
At Clippers on Tuesday, Wuwei Lan will be presenting his EMNLP 2017 paper (with Wei Xu)
Title: Automatic Paraphrase Collection and Identification in Twitter
Paraphrase is a restatement of the meaning of a text or passage using other words, which is helpful in many NLP applications, including machine translation, question answering, semantic parsing and textual similarity. Paraphrase resource is valuable and important, but it is hard to get at large scale, especially for sentence level paraphrases. Here we propose a smart way to automatically collect enormous sentential paraphrases from Twitter, which is simply grouping tweets through shared URLs. We gave the largest human-labeled golden corpus of 51,524 pairs, as well as a silver standard corpus which can grow 30k pairs per month with 70% precision. Based on this paraphrase dataset from Twitter, we experimented with deep learning models for automatic paraphrase identification. We find that without pretrained word embedding, we can still achieve state-of-the-art or more competitive results on social media dataset with only character or subword embedding, which is useful in domain with more out-of-vocabulary words or more spelling variations.
At Clippers Tuesday, Denis Newman-Griffis will be presenting his work looking at the topological structure of word embeddings and how that info can (or can’t) be used downstream.
Word embeddings are now one of the most common tools in the NLP toolbox, and we have a good sense of how to train them, tune them, and apply them effectively. However, the structure of how they encode the information used in downstream applications is much less well-understood. In this talk, I present work analyzing nearest neighborhood topological structures derived from trained word embeddings, discarding absolute feature values and maintaining only the relative organization of points. These structures exhibit several interesting properties, including high variance in the organization of neighborhood graphs derived from embeddings trained on the same corpus with different random initializations. Additionally, I show that graph node embeddings trained over the nearest neighbor graph can be substituted for the original word embeddings in both deep and shallow downstream models for named entity recognition and paraphrase detection, with only a small loss to accuracy and even an increase in recall in some cases. While these graph node embeddings suffer from the same issue of high variance due to random initializations, they exhibit some interesting properties of their own, including generating a higher density point space, remarkably poor performance on analogy tasks, and preservation of similarity at the expense of relatedness.
At Clippers on Tuesday, Adam Stiff will present on domain adaptation for question answering systems.
Abstract: On Tuesday I’ll be presenting a proposal for a strategy to train new virtual patients, which would ideally allow an educator to instantiate a new patient from one set of question-answer pairs. The idea follows some fairly recent work in one-shot learning, and the aim will be to leverage much larger corpora to try to encourage semantically similar questions to be close together in some representation space, to limit the need for extensive training for a new model. The idea is still evolving, so the talk will be very informal, and I hope to get feedback and suggestions about related research that I should be reading, potential pitfalls, extensions, etc.
Title: Deconvolutional time series regression: A technique for modeling temporally diffuse effects
Abstract: This talk proposes Deconvolutional Time Series Regression (DTSR), a general-purpose regression technique for modeling sequential data in which effects can reasonably be assumed to be temporally diffuse. DTSR jointly learns linear effect estimates and temporal convolution parameters from parallel temporal sequences of dependent variable(s) and independent variable(s), using the convolution function to assign time-varying weight to the history of each independent variable in computing the prediction for a given regression target. DTSR successfully recovers true latent convolution functions from synthetic data, and on real-world data from several psycholinguistic experiments DTSR both (1) significantly outperforms competing approaches in terms of prediction error on unseen data and (2) provides plausible, fine-grained, and fairly modality-invariant estimates of the time-course of each regressor’s influence on the dependent measure. These results support the superiority of DTSR to standard modeling approaches like linear mixed-effects regression for a range of experiment types.
Authors: Cory Shain and William Schuler
At Clippers Tuesday Lifeng Jin will present:
Unsupervised Grammar Induction with Depth-bounded PCFG
There has been recent interest in applying cognitively or empirically motivated bounds on recursion depth to limit the search space of grammar induction models. In this talk, I will present a Bayesian grammar induction model which extends this depth-bounding approach to probabilistic context-free grammar induction (DB-PCFG), which has a smaller parameter space than hierarchic sequence models, and therefore more fully exploits the space reductions of depth-bounding.
Results for this model on grammar acquisition from a synthetic dataset and transcribed child-directed speech exceed those of other models when evaluated on parse accuracy. Moreover, grammars acquired from this model demonstrate a consistent use of category labels, something which has not been demonstrated by other acquisition models.
At Clippers Tuesday, Zhen Wang will present joint work with Huan Sun on separating code from natural language text.
Title: Separating Text and Code for Next Utterance Classification in Stack Overflow
Abstract: In this talk, we will discuss our ongoing work on (1) developing tools to separate natural language text and programming code in a Stack Overflow (SO) comment, and (2) applying them to the Next Utterance Classification (NUC) task. In SO, a comment is posted after a question or answer post, and usually contains much information about follow-up questions, suggestions, opinions, etc. It is often a mixture of two different modalities: natural language and programming language, which distinguishes itself from other comments on social media like Twitter and Facebook. Such bi-modal mixture property makes it more difficult for machine to understand. We hypothesize that separating code and natural text should be the first step for tasks involving understanding programming-related text. While careful comment writers may use special formatting to distinguish natural words and programming tokens, noisy SO comments like “You will first need to: import collections # to use defaultdict” that simply mix text and code together are also very common. Therefore, in our first task, we study automatically separating code and text in noisy SO comments, which is casted as a sequence labeling problem. In our preliminary experiments, we tested a series of baseline models including traditional CRF with hand-crafted features and the state-of-the-art neural methods for NER task. Our results show that for tokens that can appear in both programming and natural language context, such as “exception”, “timeout”, and “flatten”, the baseline models cannot make accurate predictions of their labels. We are trying to improve the baseline models using domain-specific knowledge as well as more advancedneural architectures.
In our second task, we investigate whether separately modeling text and code can help the Next Utterance Classification (NUC) task on SO comments, which is to classify whether an utterance is a response to another. For training/validating/testing models, we design special rules to collect context-response pairs on Stackoverflow comments containing both natural language and code snippets.Siamese networks with tied Bi-LSTM were implemented for NUC task, with and without code snippets treated differently from natural text. Beyond the current work, our research plan is to mine the rich resources in Stack Overflow, understand text-code mixed data, and develop programming related intelligent assistants in the long run.
Any suggestions and comments are highly appreciated.
At Clippers on Tuesday, Jie Zhao will present work with Huan Sun on product-related question answering. Title and abstract below.
Title: Answer Retrieval on E-commerce Websites via Weakly Supervised Question Reformulation
Abstract: In this seminar, I will talk about our ongoing work about product-related question answering on E-commerce websites, which aims to retrieve answers from a large corpus of answer candidates. Our problem setting is different from traditional answer selection where a small answer candidate set is pre-defined and the state-of-the-art models generally adopt sophisticated models to match the semantics between the QA pairs. However, these methods will be very expensive to use when the answer candidate set is large and dynamically increasing. In our work, we adopt a classic light-weight TF-IDF search scheme for efficiency reasons but aim at better retrieval results through question reformulation. One of the challenges here is the lack of direct labeled data with pairs. To address this, we look into the word-matching results of the existing QA pairs as weak supervision signals, and define different sub-tasks that 1) learn focus attention on the question words, 2) infer words that will possibly occur in a true answer and 3) use the result of the first two sub-tasks as reformulated question to improve the final retrieval performance. We model the inter-relations among these sub-tasks and train it under a multi-task learning scheme. Preliminary results show our model has the potential to achieve better retrieval performance than existing baseline methods while guaranteeing lower search complexities. Currently, our model still does not perform very well on the second sub-task, possibly because of the large vocabulary space. We are exploring various learning strategies to further improve it. Any suggestions and comments will be appreciated.
At Clippers Tuesday, Evan Jaffe will presenting work in progress using Sequential Matching Networks to do dialogue response selection.
SMN architecture is designed to maintain dialogue history (using an RNN) and thus provide extended context. The task is formulated as ranking a set of k candidate responses, given a dialogue history. Preliminary results on a virtual patient dataset show good ranking accuracy (95% on dev) when the network chooses between the true next response, and 9 randomly selected negative examples. However, this task may be too easy, so a few more challenging tests are worth exploring, including increasing the size of k and choosing more confusable candidates. An n-gram overlap could be a good baseline. Ultimately, using the SMN to rerank an n-best list coming from a CNN model (Jin et al 2017) could prove beneficial, complementing the CNN with an ability to track previous turns. This history could be useful for questions with zero anaphora like, ‘What dose’, which crucially rely on previous turns for successful interpretation.
At Clippers on Tuesday, Symon Stevens Guille will be presenting joint work with Taylor Mahler on ethics in NLP; abstract below.
I will present the beginnings of a research project between myself and Taylor Mahler on ethical NLP and data management. I’ll discuss results from several recent papers in NLP, particularly on dialects, and sociolinguistic aspects of language use. I also review the result of Mahler et al (to appear), which illustrated several ways of fooling NLP systems into erroneously contradicting human-categorized sentiment. These results are complemented by case studies from media and civil rights investigations into the use, abuse, and (largely naive) processing of social media data by third parties, particularly the State.