Clippers 9/27: Micha Elsner on community-centered morphological annotation

Towards community-centered morphological annotation
Micha Elsner

I’ll be discussing joint work with Sara Court, Maria Copot, Noah Diewald and Stephanie Antetomaso, covering work from our recent ComputeEL publication and slightly updated version for Language Documentation & Archiving.

I hope to discuss both the existing work (for which an abstract is below) and also some of the upcoming challenges as we attempt to develop the learning part of the process into a usable and deployable part of the user experience.

There are many challenges in morphological fieldwork annotation: it heavily relies on segmentation and feature labeling (which have both practical and theoretical drawbacks), it’s time-intensive, and the annotator needs to be linguistically trained and may still annotate things inconsistently. We propose a workflow that relies on unsupervised and active learning grounded in Word-and-Paradigm morphology (WP). Machine learning has the potential to greatly accelerate the annotation process and allow a human annotator to focus on problematic cases, while the WP approach makes for an annotation system that is word-based and relational, removing the need to make decisions about feature labeling and segmentation early in the process and allowing speakers of the language of interest to participate more actively, since linguistic training is not necessary. We present a proof-of-concept for the first step of the workflow, in a realistic fieldwork setting, annotators can process hundreds of forms per hour.

Clippers 9/20: Byung-Doh Oh on the larger-gets-worse behavior of G/OPT surprisal

Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

Byung-Doh Oh and William Schuler

This work presents a replication and post-hoc analysis of recent surprising findings that larger GPT-2 language model variants that show lower perplexity nonetheless yield surprisal estimates that are less predictive of human reading times (Oh et al., 2022). First, regression analyses show a strictly monotonic, positive log-linear relationship between perplexity and fit to reading times for five GPT-Neo variants and eight OPT variants on two separate datasets, providing strong empirical support for this trend. Subsequently, analysis of residual errors reveals a systematic deviation of the larger variants, such as underpredicting reading times of named entities and overpredicting reading times of nouns that are heavily constrained by the discourse. These results suggest that the propensity of larger Transformer-based models to ‘memorize’ sequences during training makes their surprisal estimates diverge from humanlike expectations, which warrants caution in using pretrained language models to study human language processing.

Clippers 9/13: Sam Stevens on Foundation Model Encryption

We use autoregressive models’ capability to encode token sequences as a novel symmetric key cipher. We aim to demonstrate that the near-infinite possible representations for any given message means that we can empirically demonstrate CPA-security for our proposed cipher.