Cory Shain: working memory load

fMRI evidence of working memory load in naturalistic language processing
Abstract: Working memory plays a critical role in prominent theories of human incremental language processing. Because the compete parse cannot be recognized from a partial string, working memory is thought to be used to store and update parse fragments. Although constructed-stimulus experiments have produced evidence for this hypothesis, these findings have failed to generalize to naturalistic settings. In addition, the language-specificity of any such memory systems is unknown. In this study, we explore a rich set of theory-driven memory costs as predictors of human brain responses (fMRI) to naturalistic story listening, using participant specific functional localization to identify a language responsive network and a “multiple demand” network thought to support domain general working memory. Results show memory costs as postulated by the dependency locality theory, but only in the language network. We argue that working memory is indeed involved in core language comprehension processes, but that the memory resources used are housed in the language system.

3/23: Pranav on stylometry for darknet migrant identification

Title: Stylometry with Structure and Multitask Learning: Implications for Darknet Forum Migrant Analysis

Abstract:  Vendors’ trustworthiness on darknet markets is associated with an anonymous identity. Both buyers and vendors, especially influential ones, tend to migrate to new markets when a previously used market shuts down.  A better understanding of the signaling strategies used by darknet market vendors for establishing trustworthiness in their products requires linking users’ identities as they migrate across darknet forums. We develop a stylometry-based multitask learning approach for natural language and interaction modeling using graph embeddings to construct low-dimensional representations of short episodes of user activity for authorship attribution. We provide a comprehensive evaluation of our methods across four different darknet forums demonstrating its efficacy over the state-of-the-art, with a lift of up to 2.5x on Mean Retrieval Rank and 2x on Recall@10.

3/2: Willy leads discussion on the Arrau corpus

Annotating a broad range of anaphoric phenomena, in a variety of genres: the ARRAU Corpus

Olga Uryupina, Ron Artstein, Antonella Bristot, Federica Cavicchio, Francesca Delogu, Kepa J. Rodriguez, Massimo Poesio

This paper presents the second release of ARRAU, a multi-genre corpus of anaphoric information created over ten year years to provide data for the next generation of coreference / anaphora resolution systems combining different types of linguistic and world knowledge with advanced discourse modeling supporting rich linguistic annotations. The distinguishing features of ARRAU include: treating all NPs as markables, including non-referring NPs, and annotating their (non-) referentiality status; distinguishing between several categories of non-referentiality and
annotating non-anaphoric mentions; thorough annotation of markable boundaries (minimal/maximal spans, discontinuous markables); annotating a variety of mention attributes, ranging from morphosyntactic parameters to semantic category; annotating the genericity status of mentions; annotating a wide range of anaphoric relations, including bridging relations and discourse deixis; and, finally, annotating anaphoric ambiguity. The current version of the dataset contains 350K tokens and is publicly available from LDC. In this paper, we discuss in detail all the distinguishing features of the corpus, so far only partially presented in a number of conference and workshop papers; and we discuss the development between the first release of ARRAU in 2008 and this second one.

2/16: Ahmad Aljanaideh leads discussion of “Context in informational bias detection”

Context in Informational Bias Detection

Esther van den Berg, Katja Markert

Informational bias is bias conveyed through sentences or clauses that provide tangential, speculative or background information that can sway readers’ opinions towards entities. By nature, informational bias is context-dependent, but previous work on informational bias detection has not explored the role of context beyond the sentence. In this paper, we explore four kinds of context for informational bias in English news articles: neighboring sentences, the full article, articles on the same event from other news publishers, and articles from the same domain (but potentially different events). We find that integrating event context improves classification performance over a very strong baseline. In addition, we perform the first error analysis of models on this task. We find that the best-performing context-inclusive model outperforms the baseline on longer sentences, and sentences from politically centrist articles.

2/9: Sara Court leads discussion on Moeller et al “Improving Low-Resource Morphological Learning with Intermediate Forms from Finite State Transducers”

https://journals.colorado.edu/index.php/computel/article/view/427

Neural encoder-decoder models are usually applied to morphology learning as an end-to-end process without considering the underlying phonological representations that linguists posit as abstract forms before morphophonological rules are applied. Finite State Transducers for morphology, on the other hand, are developed to contain these underlying forms as an intermediate representation. This paper shows that training a bidirectional two-step encoder-decoder model of Arapaho verbs to learn two separate mappings between tags and abstract morphemes and morphemes and surface allomorphs improves results when training data is limited to 10,000 to 30,000 examples of inflected word forms.

1/26: Ash Lewis leads discussion of “Revisiting Self-Training for Neural Sequence Generation”

Revisiting Self-Training for Neural Sequence Generation

Junxian He, Jiatao Gu, Jiajun Shen, Marc’Aurelio Ranzato

Self-training is one of the earliest and simplest semi-supervised methods. The key idea is to augment the original labeled dataset with unlabeled data paired with the model’s prediction (i.e. the pseudo-parallel data). While self-training has been extensively studied on classification problems, in complex sequence generation tasks (e.g. machine translation) it is still unclear how self-training works due to the compositionality of the target space. In this work, we first empirically show that self-training is able to decently improve the supervised baseline on neural sequence generation tasks. Through careful examination of the performance gains, we find that the perturbation on the hidden states (i.e. dropout) is critical for self-training to benefit from the pseudo-parallel data, which acts as a regularizer and forces the model to yield close predictions for similar unlabeled inputs. Such effect helps the model correct some incorrect predictions on unlabeled data. To further encourage this mechanism, we propose to inject noise to the input space, resulting in a noisy version of self-training. Empirical study on standard machine translation and text summarization benchmarks shows that noisy self-training is able to effectively utilize unlabeled data and improve the performance of the supervised baseline by a large margin.

Clippers 10/27: Evan Jaffe on coreference

Evan will present his work on coreference, including a practice talk for his recent COLING accepted paper as well as newer neural net additions to the model and some new results.

Abstract:
Models of human sentence processing effort tend to focus on costs associated with retrieving structures and discourse referents from memory (memory-based) and/or on costs associated with anticipating upcoming words and structures based on contextual cues (expectation-based) (Levy, 2008). Although evidence suggests that expectation and memory may play separable roles in language comprehension (Levy et al., 2013), theories of coreference processing have largely focused on memory: how comprehenders identify likely referents of linguistic expressions. In this study, we hypothesize that coreference tracking also informs human expectations about upcoming words, and we test this hypothesis by evaluating the degree to which incremental surprisal measures generated by a novel coreference-aware semantic parser explain human response times in a naturalistic self-paced reading experiment. Results indicate (1) that coreference information indeed guides human expectations and (2) that coreference effects on memory retrieval may exist independently of coreference effects on expectations. Together, these findings suggest that the language processing system exploits coreference information both to retrieve referents from memory and to anticipate upcoming material.

Clippers 10/13: Christian leads discussion of Cynical selection of LM training data

CYNICAL SELECTION OF LANGUAGE MODEL TRAINING DATA

The Moore-Lewis method of “intelligent selection of language model training
data” is very effective, cheap, efficient… and also has structural problems.
(1) The method defines relevance by playing language models trained on the in-domain
and the out-of-domain (or data pool) corpora against each other. This powerful
idea – which we set out to preserve – treats the two corpora as the opposing ends
of a single spectrum. This lack of nuance does not allow for the two corpora to be
very similar. In the extreme case where the come from the same distribution, all of
the sentences have a Moore-Lewis score of zero, so there is no resulting ranking.
(2) The selected sentences are not guaranteed to be able to model the in-domain data,
nor to even cover the in-domain data. They are simply well-liked by the in-domain
model; this is necessary, but not sufficient.
(3) There is no way to tell what is the optimal number of sentences to select, short of
picking various thresholds and building the systems.
We present “cynical selection of training data”: a greedy, lazy, approximate, and generally
efficient method of accomplishing the same goal. It has the following properties:
(1) Is responsive to the extent to which two corpora differ.
(2) Quickly reaches near-optimal vocabulary coverage.
(3) Takes into account what has already been selected.
(4) Does not involve defining any kind of domain, nor any kind of classifier.
(5) Has real units.
(6) Knows approximately when to stop.

Clippers 10/6: Jeniya Tabassum leads discussion of BLEURT

BLEURT: Learning Robust Metrics for Text Generation
Thibault Sellam Dipanjan Das Ankur P. Parikh

Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that
uses millions of synthetic examples to help the model generalize. BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.