Clippers 4/16: Byung-Doh Oh on the bigger-is-worse effects of model size and training data of large language model surprisal on human reading times

The bigger-is-worse effects of model size and training data of large language model surprisal on human reading times

(Saarland University colloquium practice talk)

Surprisal estimates from Transformer-based large language models (LLMs) are often used to model expectation-based effects in human sentence processing, which are facilitations in processing driven by the predictability of each upcoming word. This talk presents a series of analyses showing that surprisal estimates from LLM variants that are bigger and are trained on more data are worse predictors of processing difficulty that manifests in human reading times. First, regression analyses show a strong inverse correlation between model size and fit to reading times across three LLM families on two separate datasets. An error analysis reveals a systematic deviation for the larger variants, such as underpredicting reading times of named entities and making compensatory overpredictions for reading times of function words. Subsequently, LLM variants that vary in the amount of training data show that their surprisal estimates generally provide the best fit after seeing about two billion training tokens and begin to diverge with more training data. The adverse influence of model size also begins to emerge at this point and becomes stronger as training continues. Finally, based on recent findings on the scaling behavior of LLMs, word frequency is presented as a unified explanation for these two effects. The theoretical implications of these results will be discussed.

Clippers 4/9: Christian Clark, Midwest Speech and Language Days practice talk

Grammar induction, the task of learning a set of syntactic rules from minimally annotated training data, can provide evidence about the mechanisms underlying children’s language acquisition. Recent work has achieved advances in the induction of probabilistic context-free grammars (PCFGs). However, less attention has been paid to inducing categorial grammars, despite their appealing properties such as a transparent syntax–semantics interface. Motivated by this, we introduce a new model for inducing a basic categorial grammar. The model attains comparable accuracy to state-of-the-art PCFG systems and learns from raw data without part-of-speech information, in contrast to earlier categorial grammar induction systems.

Clippers 4/2: Sara Court on Leveraging LLMs for Low-Resource Translation

This work investigates the in-context learning abilities of LLM foundation models when instructed to translate text from a low resource language into a high resource language as part of an automated machine translation pipeline. As case studies, I conduct a set of experiments using two language pairs, Inuktitut-English and Quechua-Spanish, and examine the informativity of various types of lexical and grammatical information retrieved from a constrained database of pedagogical materials (dictionaries and grammar lessons) as well as sentence-length examples retrieved from parallel corpora designed for traditional NLP tasks. Ablation studies that manipulate (1) context type (morpheme definitions, grammar lessons, and corpus examples), (2) retrieval methods (automated vs. manual), and (3) model type (GPT-4, GPT 3.5 turbo, Llama2, and Gemini) suggest that even relatively small (7B) LLMs are capable of utilizing prompt context for zero-shot translation when provided a minimally sufficient amount of relevant linguistic information. However, the variable effects of database construction, retrieval method, model type, and linguistic structure highlight the limitations of even the best LLMs as standalone translation systems for the majority of the world’s 7,000+ languages and their speakers.