Clippers 9/20: Byung-Doh Oh on the larger-gets-worse behavior of G/OPT surprisal

Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

Byung-Doh Oh and William Schuler

This work presents a replication and post-hoc analysis of recent surprising findings that larger GPT-2 language model variants that show lower perplexity nonetheless yield surprisal estimates that are less predictive of human reading times (Oh et al., 2022). First, regression analyses show a strictly monotonic, positive log-linear relationship between perplexity and fit to reading times for five GPT-Neo variants and eight OPT variants on two separate datasets, providing strong empirical support for this trend. Subsequently, analysis of residual errors reveals a systematic deviation of the larger variants, such as underpredicting reading times of named entities and overpredicting reading times of nouns that are heavily constrained by the discourse. These results suggest that the propensity of larger Transformer-based models to ‘memorize’ sequences during training makes their surprisal estimates diverge from humanlike expectations, which warrants caution in using pretrained language models to study human language processing.