Clippers 4/16: Byung-Doh Oh on the bigger-is-worse effects of model size and training data of large language model surprisal on human reading times

The bigger-is-worse effects of model size and training data of large language model surprisal on human reading times

(Saarland University colloquium practice talk)

Surprisal estimates from Transformer-based large language models (LLMs) are often used to model expectation-based effects in human sentence processing, which are facilitations in processing driven by the predictability of each upcoming word. This talk presents a series of analyses showing that surprisal estimates from LLM variants that are bigger and are trained on more data are worse predictors of processing difficulty that manifests in human reading times. First, regression analyses show a strong inverse correlation between model size and fit to reading times across three LLM families on two separate datasets. An error analysis reveals a systematic deviation for the larger variants, such as underpredicting reading times of named entities and making compensatory overpredictions for reading times of function words. Subsequently, LLM variants that vary in the amount of training data show that their surprisal estimates generally provide the best fit after seeing about two billion training tokens and begin to diverge with more training data. The adverse influence of model size also begins to emerge at this point and becomes stronger as training continues. Finally, based on recent findings on the scaling behavior of LLMs, word frequency is presented as a unified explanation for these two effects. The theoretical implications of these results will be discussed.

Clippers 4/9: Christian Clark, Midwest Speech and Language Days practice talk

Grammar induction, the task of learning a set of syntactic rules from minimally annotated training data, can provide evidence about the mechanisms underlying children’s language acquisition. Recent work has achieved advances in the induction of probabilistic context-free grammars (PCFGs). However, less attention has been paid to inducing categorial grammars, despite their appealing properties such as a transparent syntax–semantics interface. Motivated by this, we introduce a new model for inducing a basic categorial grammar. The model attains comparable accuracy to state-of-the-art PCFG systems and learns from raw data without part-of-speech information, in contrast to earlier categorial grammar induction systems.

Clippers 4/2: Sara Court on Leveraging LLMs for Low-Resource Translation

This work investigates the in-context learning abilities of LLM foundation models when instructed to translate text from a low resource language into a high resource language as part of an automated machine translation pipeline. As case studies, I conduct a set of experiments using two language pairs, Inuktitut-English and Quechua-Spanish, and examine the informativity of various types of lexical and grammatical information retrieved from a constrained database of pedagogical materials (dictionaries and grammar lessons) as well as sentence-length examples retrieved from parallel corpora designed for traditional NLP tasks. Ablation studies that manipulate (1) context type (morpheme definitions, grammar lessons, and corpus examples), (2) retrieval methods (automated vs. manual), and (3) model type (GPT-4, GPT 3.5 turbo, Llama2, and Gemini) suggest that even relatively small (7B) LLMs are capable of utilizing prompt context for zero-shot translation when provided a minimally sufficient amount of relevant linguistic information. However, the variable effects of database construction, retrieval method, model type, and linguistic structure highlight the limitations of even the best LLMs as standalone translation systems for the majority of the world’s 7,000+ languages and their speakers.

Clippers 3/26: Amad Hussain, A Review of RAPTOR: Can Tree-Organized Retrieval Improve a Virtual Museum Tour Guide

This week in Clippers (3/26) I will be presenting a review of the paper, RAPTOR: Recursive Abstractive Processing For Tree-Organized Retrieval ( This work seeks to semantically cluster packages within a corpus and hierarchically create summaries based upon these clusters. A retrieval system may then present the original passages or summaries to a downstream LLM for Retrieval-Augmented-Generation (RAG). The authors present SOTA results over question-answering answering tasks, especially that requiring multi-step reasoning. In our talk, we will review RAPTOR and seek to explore how it, and other related retrieval solutions, can be applied to the existing Virtual Museum Tour Guide project in collaboration with COSI. This will basically be a brainstorming session following a paper review so I am hoping for good discussion.

Clippers 3/19: Christian Clark on semantically aided categorial grammar induction

Studies of grammar induction are a source of evidence about the mechanisms underlying children’s language acquisition. Manipulating the prior knowledge and inductive biases of grammar inducers can yield insights about the learnability of syntactic structure under various assumptions about the learner. While early induction models often relied on annotated data, more recent models have made progress toward learning from raw data, working with both probabilistic context-free grammars and categorial grammars. Still, accuracy levels of current systems fall well below human learners.

Incorporating world knowledge into grammar inducers is a potential path toward further improvement, one which is well motivated by psycholinguistic theory (e.g. semantic bootstrapping). Along these lines, I will present a categorial grammar inducer that incorporates semantic knowledge — implemented as association weights between predicate roles — into an existing syntax-only inducer. Associations can be distilled from large language models (LLMs), opening up possibilities not only for better grammar induction but also for exploration of the conceptual knowledge acquired by LLMs. This project is still a work in progress, but I will present some preliminary results on synthetic data and broad-coverage corpora.

Clippers 3/5: Alyssa Allen on SQL Query Explainability using Natural Language Generation

SQL Query Explainability using Natural Language Generation

This work is rooted in a larger project aimed at developing a dialogue system that helps increase transparency of database query outputs for non-expert SQL users. Previously, I’ve discussed processes for building a training set using few-shot prompting and a hand-annotated set of commented queries. Additionally, I’ve discussed test set results from LLMs (such as ChatGPT and Llama). This presentation will shift focus to the content of the natural language.

I’ll discuss the development of comment guidelines and the need for guidelines in standardizing the evaluation. Comment guidelines should ideally provide transparency in what constitutes a “good” comment. Comments should also 1) reflect certain properties of the relational database structure, 2) prioritize semantic fidelity to the query and 3) align with the user language wherever appropriate. The comment guidelines use these core elements to outline how generated natural language can increase explainability of database queries. Our methods will be compared to approaches that leverage templated or rule-based systems of explainability.

Clippers 2/20: Byung-Doh Oh, Frequency Explains the Inverse Correlation of Large Language Models’ Size, Training Data Amount, and Surprisal’s Fit to Reading Times

Frequency Explains the Inverse Correlation of Large Language Models’ Size, Training Data Amount, and Surprisal’s Fit to Reading Times

Recent studies have shown that as Transformer-based language models become larger and are trained on very large amounts of data, the fit of their surprisal estimates to naturalistic human reading times degrades. The current work presents a series of analyses showing that word frequency is a key explanatory factor underlying these two trends. First, residual errors from four language model families on four corpora show that the inverse correlation between model size and fit to reading times is the strongest on the subset of least frequent words, which is driven by excessively accurate predictions of larger model variants. Additionally, training dynamics reveal that during later training steps, all model variants learn to predict rare words and that larger model variants do so more accurately, which explains the detrimental effect of both training data amount and model size on fit to reading times. Finally, a feature attribution analysis demonstrates that larger model variants are able to accurately predict rare words based on both an effectively longer context window size as well as stronger local associations compared to smaller model variants. Taken together, these results indicate that Transformer-based language models’ surprisal estimates diverge from human-like expectations due to the superhumanly complex associations they learn for predicting rare words.

Clippers 2/6: Ash Lewis on a user study of interactive KB querying

In Clippers on Tuesday, February 6th, I will be presenting the results of a user study we (Lingbo Mo, Huan Sun, Mike White, and myself) conducted in order to test the viability of an interactive semantic parsing system we built. The system was designed to help users query a knowledge base in natural language, offsetting the need to know the query language that the knowledge base uses and thus making the information more accessible to novice users. Our system decomposes the query into pieces and translates them into understandable natural language, so that users can see exactly how the system reached an answer and therefore be confident in it. Alternatively, if the parse is incorrect, the user can utilize a natural language interface to correct it.

This work was conducted in the “pre-LLM era” and thus much of the technical contribution is a bit outdated. However, the user study, in which we had crowdworkers test several versions of the system, has broad application to human evaluation of dialogue systems. As dialogue systems become increasingly ubiquitous, we believe our experience conducting this user study has important lessons to contribute to evaluation methodologies.

My goal for Clippers is to make clearer the “story” for a paper about evaluation – this project has spanned many years and there is a great deal of content to sift through. I hope to get fresh eyes on that content and get feedback on the most salient pieces.

Clippers 1/30: Chris Brew on building a summarizer module for Lexis+AI

Building a summarizer module for Lexis+AI

With minimal prompting, commercial large language models can produce useful indicative summaries of many documents. Given informed and tolerant readers, the bar for usefulness is low, and current models easily achieve it. But these summaries do not meet the standards required of a professional information product. We show that, for legal documents, a “faceted” approach to summarization can smooth the path to acceptable professional quality. The Lexis+AI product currently covers about three and a half use cases, which I will explain and demonstrate.

In an applied AI setting, and especially for LLMs, evaluation is a key issue, and one which plays out differently for each use case, and also differently from what is normal in academic NLP. If time permits, I will try to give my impressions of how this really works in practice, and point at opportunities for high-impact work on evaluation.

In other words, we’ll finish up talking a little about what “acceptable professional quality” might mean. I am definitely speaking myself on this, not representing a company position.

Clippers 1/23: Sara Court and Alyssa Allen, Project Workshopping/Brainstorming

Sara will be workshopping developments for her QP2 on leveraging pedagogical materials with LLMs for low-resource machine translation.

Alyssa will be workshopping directions for a potential collaborative project related to human-machine interactions. The experiments will involve an embodied language-capable robot. Research questions will likely focus on how the robot can best align with human conversational preferences. Example linguistic/conversational features of interest include backchanneling, laughter, cooperative overlap, and rate of speech.