Analyzing the predictive power of neural LM surprisal
This work presents an in-depth analysis of an observation that contradicts the findings of recent work in computational psycholinguistics, namely that smaller GPT-2 models that show higher test perplexity nonetheless generate surprisal estimates that are more predictive of human reading times. Analysis of the surprisal values shows that rare proper nouns, which are typically tokenized into multiple subword tokens, are systematically assigned lower surprisal values by the larger GPT-2 models. A comparison of residual errors from regression models fit to reading times reveals that regression models with surprisal predictors from smaller GPT-2 models have significantly lower mean absolute errors on words that are tokenized into multiple tokens, while this trend is not observed on words that are kept intact. These results indicate that the ability of larger GPT-2 models to predict internal pieces of rare words more accurately makes their surprisal estimates deviate from humanlike expectations that manifest in self-paced reading times and eye-gaze durations.