There has been considerable interest in predicting reading times and brain imaging data using predictors from large language models (LLMs), with some conjecturing a positive ‘quality-power’ effect of (inverse) language model (LM) perplexity on psychometric predictors, which favors larger models. Recent experiments using these models’ negative log word probability as a predictor have cast doubt on this effect (Oh et al., 2022; Oh and Schuler, 2023), instead finding an inverse relationship that favors smaller models, but other experiments predicting psychometric data directly from LM vectors (Schrimpf et al., 2021) have shown improved fit to reading times as model perplexity decreases, favoring larger models again. However, these studies using model vectors introduce a potential confound in that they also simultaneously vary the number of predictors, which increases the number of degrees of freedom of the model. The experiments described in this talk therefore evaluate the number of predictors as a possible confound to the quality-power effect.
Month: February 2025
Clippers 2/18: Ash Lewis on Knowledge Distillation vs. Self-Training for Reducing Hallucination in QA Agents
Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in QA Agents
The deployment of Large Language Models (LLMs) in customer support is constrained by hallucination—generating false information—and the high cost of proprietary models. To address these challenges, we propose a retrieval-augmented question-answering (QA) pipeline and explore how to balance human input and automation. Using a dataset of questions about a Samsung Smart TV user manual, we demonstrate that synthetic data generated by LLMs outperforms crowdsourced data in reducing hallucination in finetuned models. We also compare self-training (fine-tuning models on their own outputs) and knowledge distillation (fine-tuning on stronger models’ outputs, e.g., GPT-4o), and find that self-training achieves comparable hallucination reduction. We conjecture that this surprising finding can be attributed to increased exposure bias issues in the knowledge distillation case and support this conjecture with post hoc analysis. We also improve robustness to unanswerable questions and retrieval failures with contextualized “I don’t know” responses. These findings show that scalable, cost-efficient QA systems can be built using synthetic data and self-training with open-source models, reducing reliance on proprietary tools or costly human annotations.
Clippers 2/4: Sam Stevens on DSPy, compiling prompts, and similar work
As language models continue to evolve, the complexity of prompt engineering has grown in parallel. My talk examines the fundamental insights of DSPy through the lens of plib, a minimalist implementation that highlights key principles often overlooked in current LLM research. I argue that automated few-shot example selection can match or exceed carefully crafted zero-shot prompts, challenging the conventional wisdom of prompt engineering. The framework introduces a novel perspective on compute scaling in language models, suggesting “prompt compilation” as a fourth axis alongside pre-training, post-training, and inference-time computation. By treating prompt optimization as a reinforcement learning problem with verifiable rewards, plib offers a systematic approach to example selection. I argue that this style of thinking enables the decomposition of complex language tasks into modular sub-programs, a capability that proves challenging with traditional prompting methods. I will illustrate how many contemporary developments in LLM applications are natural extensions of principles already present in DSPy’s design, arguing for a renewed examination of these foundational ideas in the context of modern language model development.