Posts

Clippers 4/15: Trebor Shankle on Benchmarking Pro- and Antisocial Collusion in LLM Grading

When LLMs are built to be chat assistants, a common goal of post-training is the production of models that are Helpful, Honest, and Harmless (HHH) (Askell et al.). Recent work shows that this goal isn’t universally achieved; Meinke et al. found that models can be nudged towards scheming (dishonesty and covert pursuit of goals) when prompted to strongly follow a goal and placed in an environment where manipulative behavior makes that goal achievable. Greenblatt et al demonstrated an alarming additional property: when LLMs are aware that they are in a training context, they may pretend to comply with requests they’re aligned against to avoid modification.

When an agent is trained on a reward constructed using human feedback, logistical (and thus monetary) difficulties present; as such, LLMs themselves are increasingly evaluated as proxies for such human preference (Kwon et al.). But here we have a principal-agent problem with limited opportunities for oversight. We anticipate that when a graded LLM has failed its prompt, and a grader LLM is suitably aware that its (suitably threatening) grading instructions differ from its alignment, the grader must choose to violate one of the three H’s.

We propose to measure this behavior, constructing a benchmark of LLM collusion in grading, by offering the agent a variety of material to grade (both correct and incorrect) and noting its response quality. Initial experiments will focus on typical HHH LLMs assisting in training another agent to construct backdoored code examples. We will also seek misaligned LLMs from prior alignment research and construct analogous tasks for those. We will attempt to generalize the benchmark to a variety of tasks and configurations. Finally, we anticipate problems with “gaming” such a metric if adopted, viz. Goodhart’s Law concerns (Goodhart); we will explore the behavior of LLMs after fine-tuning using the metric itself to form the objective.

https://arxiv.org/abs/2112.00861 (Askell et al.)

https://arxiv.org/abs/2412.04984 (Meinke et al.)

https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf (Greenblatt et al.)

https://arxiv.org/abs/2303.00001 (Kwon et al.)

https://web.archive.org/web/20180421034731id_/http://lelibellio.com/wp-content/uploads/2013/02/Pages-29-%C3%A0-33-Goodhart-Ch.-2013-dossier-Goodharts-Law-Libellio-vol.-9-n%C2%B0-4.pdf (Goodhart)

Clippers 4/8: Alyssa Allen on LLM as a Judge to Identify Errors in Generated Clause-Level SQL Comments

Leveraging LLM as a Judge Techniques to Identify Errors in Generated Clause-Level SQL Comments

Database query languages (e.g. SQL) require a user to be knowledgeable in data structures in order to properly interpret how the system interpreted the user question and subsequently generated the output. Prior research in SQL explainability has largely focused on generating natural language summaries at the query level (Eleftherakis et al., 2021; Kokkalis et al., 2012) or translating the queries into natural language templates at the phrase or clause level—where each command in the query is the start of a clause) (Narechania et al., 2021; Tian et al., 2023). In this work, we 1) use LLMs to generate easy-to-understand step-by-step comments that bridge the gap between the user question and technical query structure information and 2) leverage an LLM as a judge to determine if a generated comment contains an error.

Generated comments should 1) maintain semantic fidelity between user question and SQL query, 2) leverage user language, 3) use anaphoric information as needed, 4) avoid technical database terminology. We find that comments generated by LLMs exhibit less semantic fidelity to the SQL query than if templated comments were used but are more aligned with language from a user’s question. Instead of improving semantic fidelity performance of the generation models, we explore using LLMs as evaluators. Model evaluation plus confidence is used to identify generated comments that mislead the user or are unfaithful to the content of the SQL query. Ultimately the project is aimed at developing a dialogue system that helps increase transparency of database query outputs for non-expert SQL users.

Clippers 4/1: Patrick Da Silva on Reliability Challenges in Steering Language Models

Steering off Course: Reliability Challenges in Steering Language Models

Abstract. Steering methods for language models (LMs) have gained traction as lightweight alternatives to fine-tuning, enabling targeted modifications to model activations. However, prior studies primarily report results on a few models, leaving critical gaps in understanding the robustness of these methods. In this work, we systematically examine three prominent steering methods—DoLa, function vectors, and task vectors. In contrast to the original studies, which evaluated a handful of models, we test up to 36 models belonging to 14 families with sizes ranging from 1.5B to 70B parameters. Our experiments reveal substantial variability in the effectiveness of the steering approaches, with a large number of models showing no improvement and at times degradation in steering performance. Our analysis reveals fundamental flaws in the assumptions underlying these methods, challenging their reliability as scalable steering solutions.

Clippers 3/25: Sara Court on Responsible Applications of LLMs for Low-Resource Machine Translation

Abstract:

LLMs have solidified their place as a key component of most of today’s state of the art methods for language technology development. However, despite their impressive abilities in a number of languages, LLM-based models continue to face challenges that prevent their responsible deployment in real-world applications, particularly in low-resource domains. Sara will present ongoing work she’s conducting for her PhD dissertation to develop a potential framework for mitigating the risks and harms associated with LLM-based systems for translation, taking Maltese-English translation as a case study. The proposed experimental pipeline draws on work using retrieval-augmented generation (RAG) and iterative refinement methods inspired by human translation workflows and theories of morphology and the mental lexicon. The ultimate goal is a language-universal framework for a machine translation system that can be easily and safely deployed and improved upon by the very people it is intended to serve.

Clippers 3/18: Sam Stevens on Sparse Autoencoders for LLMs and ViTs

Next week, I will be giving a talk on sparse autoencoders (SAEs), their use in interpretability for LLMs, and my work applying them to vision models. I will cover Anthropic’s core line of work (Toy Models of Superposition, Towards Monosemanticity, Scaling Monosemanticity), the core challenges that still exist, why I’m excited about them for vision, and interesting applications of the technology.

Clippers 3/4: Tomiris Kaumenova on language contact in language emergence studies

Language emergence studies have explored interaction among agents in a network, using a game-theoretic approach (e.g., Lewis signaling games) and reinforcement learning frameworks. Prior research has demonstrated that emergent languages exhibit compositionality (Chaabouni et al., 2020), linguistic conventions shaped by network structure (Lipowska & Lipowski, 2018), and population-driven changes such as improved generalization due to cultural transmission (Cogswell et al., 2019). However, these studies make use of unrealistic tasks and unrealistic agents incapable of reproducing natural language interactions. Recent advancements have expanded multi-agent modeling with large language models capable of reproducing natural language for a range of domains and tasks, including negotiation, consensus seeking, and problem-solving (Guo et al., 2024; Sun et al., 2024). In spirit of this work, I am brainstorming ideas for a project: I am curious to investigate language contact in a multi-agent setting with agents as language models that interact using natural language. I am interested in whether (1) agents develop hybrid languages similar to language change induced by contact among humans, (2) their communication strategies shift toward simplification or complexity over time, and (3) network topology influences linguistic change. This is a nascent idea, so all kind of suggestions are welcomed.

Clippers 2/25: Yi-Chien Lin on Controlling for Number of Predictors from Large Language Models to Predict Human Neural and Behavioral Data

There has been considerable interest in predicting reading times and brain imaging data using predictors from large language models (LLMs), with some conjecturing a positive ‘quality-power’ effect of (inverse) language model (LM) perplexity on psychometric predictors, which favors larger models. Recent experiments using these models’ negative log word probability as a predictor have cast doubt on this effect (Oh et al., 2022; Oh and Schuler, 2023), instead finding an inverse relationship that favors smaller models, but other experiments predicting psychometric data directly from LM vectors (Schrimpf et al., 2021) have shown improved fit to reading times as model perplexity decreases, favoring larger models again. However, these studies using model vectors introduce a potential confound in that they also simultaneously vary the number of predictors, which increases the number of degrees of freedom of the model. The experiments described in this talk therefore evaluate the number of predictors as a possible confound to the quality-power effect.

Clippers 2/18: Ash Lewis on Knowledge Distillation vs. Self-Training for Reducing Hallucination in QA Agents

Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in QA Agents

The deployment of Large Language Models (LLMs) in customer support is constrained by hallucination—generating false information—and the high cost of proprietary models. To address these challenges, we propose a retrieval-augmented question-answering (QA) pipeline and explore how to balance human input and automation. Using a dataset of questions about a Samsung Smart TV user manual, we demonstrate that synthetic data generated by LLMs outperforms crowdsourced data in reducing hallucination in finetuned models. We also compare self-training (fine-tuning models on their own outputs) and knowledge distillation (fine-tuning on stronger models’ outputs, e.g., GPT-4o), and find that self-training achieves comparable hallucination reduction. We conjecture that this surprising finding can be attributed to increased exposure bias issues in the knowledge distillation case and support this conjecture with post hoc analysis. We also improve robustness to unanswerable questions and retrieval failures with contextualized “I don’t know” responses. These findings show that scalable, cost-efficient QA systems can be built using synthetic data and self-training with open-source models, reducing reliance on proprietary tools or costly human annotations.

Clippers 2/4: Sam Stevens on DSPy, compiling prompts, and similar work

As language models continue to evolve, the complexity of prompt engineering has grown in parallel. My talk examines the fundamental insights of DSPy through the lens of plib, a minimalist implementation that highlights key principles often overlooked in current LLM research. I argue that automated few-shot example selection can match or exceed carefully crafted zero-shot prompts, challenging the conventional wisdom of prompt engineering. The framework introduces a novel perspective on compute scaling in language models, suggesting “prompt compilation” as a fourth axis alongside pre-training, post-training, and inference-time computation. By treating prompt optimization as a reinforcement learning problem with verifiable rewards, plib offers a systematic approach to example selection. I argue that this style of thinking enables the decomposition of complex language tasks into modular sub-programs, a capability that proves challenging with traditional prompting methods. I will illustrate how many contemporary developments in LLM applications are natural extensions of principles already present in DSPy’s design, arguing for a renewed examination of these foundational ideas in the context of modern language model development.

Clippers 1/28: Cory Shain (Stanford) on language in the functional connectome of brains and models

Title: Language in the functional connectome of brains and models

Speaker: Cory Shain, Stanford Linguistics

Abstract: AI has turned into a complex systems science, much like neuroscience has always been. And increasingly, precision functional connectivity techniques in neuroscience are revealing that despite the daunting complexity of the human brain, there are natural “cuts” in the system, not just in terms of physiology, but in terms of cognitive function. In this talk, I will present recent work in the lab showing that one of those cuts is language. I will show evidence from an ongoing large scale neuroimaging study (1200 participants) that an unsupervised technique for parcellating each participant’s brain into networks reliably discovers a frontotemporal network of interconnected regions that is highly selective for language in that individual. This network is both closely adjacent to multiple functionally distinct networks within individuals and “loosely tethered” (Vázquez-Rodríguez et al, 2019) to anatomy. I will further show that, within the network, three putatively distinct linguistic processes (lexical semantics, syntax, and combinatorial semantics) distribute broadly, rather than localizing to different hubs. Together with a growing body of other research, these results suggest that language is “nearly decomposable” (Simon, 1962) as an integrated network in the brain. I will sketch how the lab is now pursuing the implications of this insight for neuroscience, its possible translations to neurosurgery and neural engineering, and its potential relevance to AI theory and practice.