Evan will present his work on coreference, including a practice talk for his recent COLING accepted paper as well as newer neural net additions to the model and some new results.
Models of human sentence processing effort tend to focus on costs associated with retrieving structures and discourse referents from memory (memory-based) and/or on costs associated with anticipating upcoming words and structures based on contextual cues (expectation-based) (Levy, 2008). Although evidence suggests that expectation and memory may play separable roles in language comprehension (Levy et al., 2013), theories of coreference processing have largely focused on memory: how comprehenders identify likely referents of linguistic expressions. In this study, we hypothesize that coreference tracking also informs human expectations about upcoming words, and we test this hypothesis by evaluating the degree to which incremental surprisal measures generated by a novel coreference-aware semantic parser explain human response times in a naturalistic self-paced reading experiment. Results indicate (1) that coreference information indeed guides human expectations and (2) that coreference effects on memory retrieval may exist independently of coreference effects on expectations. Together, these findings suggest that the language processing system exploits coreference information both to retrieve referents from memory and to anticipate upcoming material.
Jeniya will present her recently-completed W-NUT shared task on analysis of wet lab protocols. The shared task summary paper draft will be sent to the email list; contact Jeniya if you need a copy.
CYNICAL SELECTION OF LANGUAGE MODEL TRAINING DATA
The Moore-Lewis method of “intelligent selection of language model training
data” is very effective, cheap, efficient… and also has structural problems.
(1) The method defines relevance by playing language models trained on the in-domain
and the out-of-domain (or data pool) corpora against each other. This powerful
idea – which we set out to preserve – treats the two corpora as the opposing ends
of a single spectrum. This lack of nuance does not allow for the two corpora to be
very similar. In the extreme case where the come from the same distribution, all of
the sentences have a Moore-Lewis score of zero, so there is no resulting ranking.
(2) The selected sentences are not guaranteed to be able to model the in-domain data,
nor to even cover the in-domain data. They are simply well-liked by the in-domain
model; this is necessary, but not sufficient.
(3) There is no way to tell what is the optimal number of sentences to select, short of
picking various thresholds and building the systems.
We present “cynical selection of training data”: a greedy, lazy, approximate, and generally
efficient method of accomplishing the same goal. It has the following properties:
(1) Is responsive to the extent to which two corpora differ.
(2) Quickly reaches near-optimal vocabulary coverage.
(3) Takes into account what has already been selected.
(4) Does not involve defining any kind of domain, nor any kind of classifier.
(5) Has real units.
(6) Knows approximately when to stop.
BLEURT: Learning Robust Metrics for Text Generation
Thibault Sellam Dipanjan Das Ankur P. Parikh
Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that
uses millions of synthetic examples to help the model generalize. BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.
Acquiring language from speech by learning to remember and predict
Classical accounts of child language learning invoke memory limits as a pressure to discover sparse, language-like representations of speech, while more recent proposals stress the importance of prediction for language learning. In this talk, I will describe a broad-coverage unsupervised neural network model to test memory and prediction as sources of signal by which children might acquire language directly from the perceptual stream. The model embodies several likely properties of real-time human cognition: it is strictly incremental, it encodes speech into hierarchically organized labeled segments, it allows interactive top-down and bottom-up information flow, it attempts to model its own sequence of latent representations, and its objective function only recruits local signals that are plausibly supported by human working memory capacity. Results show that much phonemic structure is learnable from unlabeled speech on the basis of these local signals. In addition, remembering the past and predicting the future both contribute independently to the linguistic content of acquired representations.
UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training
Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks using a novel training procedure, referred to as a pseudo-masked language model (PMLM). Given an input text with masked tokens, we rely on conventional masks to learn inter-relations between corrupted tokens and context via autoencoding, and pseudo masks to learn intra-relations between masked spans via partially autoregressive modeling. With well-designed position embeddings and self-attention masks, the context encodings are reused to avoid redundant computation. Moreover, conventional masks used for autoencoding provide global masking information, so that all the position embeddings are accessible in partially autoregressive language modeling. In addition, the two tasks pre-train a unified language model as a bidirectional encoder and a sequence-to-sequence decoder, respectively. Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks across several widely used benchmarks.
Language (Technology) is Power: A Critical Survey of “Bias” in NLP
Su Lin Blodgett, Solon Barocas, Hal Daumé III, Hanna Wallach
We survey 146 papers analyzing “bias” in NLP systems, finding that their motivations are often vague, inconsistent, and lacking in normative reasoning, despite the fact that analyzing “bias” is an inherently normative process. We further find that these papers’ proposed quantitative techniques for measuring or mitigating “bias” are poorly matched to their motivations and do not engage with the relevant literature outside of NLP. Based on these findings, we describe the beginnings of a path forward by proposing three recommendations that should guide work analyzing “bias” in NLP systems. These recommendations rest on a greater recognition of the relationships between language and social hierarchies, encouraging researchers and practitioners to articulate their conceptualizations of “bias”—i.e., what kinds of system behaviors are harmful, in what ways, to whom, and why, as well as the normative reasoning underlying these statements—and to center work around the lived experiences of members of communities affected by NLP systems, while interrogating and reimagining the power relations between technologists and such communities.
Given raw (in our case, textual) sentences as input, the Paradigm Discovery Problem (PDP) (Elsner et al., 2019, Erdmann et al., 2020) involves a bi-directional clustering of words into paradigms and cells. For instance, solving the PDP requires one to determine that ring and rang belong to the same paradigm, while bring and bang do not, and that rang and banged belong to the same cell, as they realize the same morphosyntactic property set, i.e., past tense. Solving the PDP is necessary in order to bootstrap to solving what’s often referred to as the Paradigm Cell Filling Problem (PCFP) (Ackerman et al., 2009), i.e., predicting forms that fill yet unrealized cells in partially attested paradigms. That is to say, if I want the plural of thesis, but have only seen the singular, I can only predict theses if I’ve solved the PDP in such a way that allows me to make generalizations regarding how number is realized.
Two forthcoming works address constrained versions of the PDP by focusing on a single part of speech at a time (Erdmann et al., 2020; Kann et al., 2020). For my dissertation, I am trying to adapt the system of Erdmann et al. (2020) to handle the unconstrained PDP by addressing scalability and overfitting issues which lock the system into poor predictions regarding the size of paradigms and prematurely eliminate potentially rewarding regions of the search space. This will be a very informal talk, I’m just looking to get some feedback on some issues I keep running into.
High frequency marker categories in grammar induction
High frequency marker words have been shown crucial in first language acquisition where they provide reliable clues for speech segmentation and grammatical categorization of words. Recent work in model selection of grammar induction has also hinted at a similar role played by high frequency marker words in distributionally inducing grammars. In this work, we first expand the notion of high frequency marker words to high frequency marker categories to include languages where grammatical relations between words are expressed by morphology, not word order. Through analysis of data from previous work and experiments with novel induction models, this work shows that high frequency marker categories are the main drive of accurate grammar induction.
Title: An unsupervised discrete-state sequence model of human language acquisition from speech
Abstract: I will present a progress report on an ongoing attempt to apply discrete-state multi-scale recurrent neural networks as models of child language acquisition from speech. The model is inspired by prior arguments that abstract linguistic representations (e.g. phonemes and words) constrain the acoustic form of natural language utterances, and thus that attempting to efficiently store and anticipate auditory signals may emergently guide child learners to discover underlying linguistic structure. In this study, the artificial learner is a recurrent neural network arranged in interacting layers. Information exchange between adjacent layers is governed by binary detector neurons. When the detector neuron fires between two layers, those layers exchange their current analyses of the input signal in the form of discrete binary codes. Thus, in line with much existing linguistic theory, the model exploits both bottom-up and top-down signals to produce a representation of the input signal that is segmental, discrete, and featural. The learner adapts this behavior in service of four simultaneous unsupervised objectives: reconstructing the past, predicting the future, reconstructing the segment given a label, and reconstructing the label given a segment. Each layer treats the layer below as data, and thus learning is partially driven by attempting to model the learner’s own mental state, in line with influential hypotheses from cognitive neuroscience. The model solves a novel task (unsupervised joint segmentation and labeling of phonemes and words from speech), and it is therefore difficult to establish an overall state of the art performance threshold. However, results for the subtask of unsupervised word segmentation currently lag well behind the state of the art.