Clippers 4/15: Trebor Shankle on Benchmarking Pro- and Antisocial Collusion in LLM Grading

When LLMs are built to be chat assistants, a common goal of post-training is the production of models that are Helpful, Honest, and Harmless (HHH) (Askell et al.). Recent work shows that this goal isn’t universally achieved; Meinke et al. found that models can be nudged towards scheming (dishonesty and covert pursuit of goals) when prompted to strongly follow a goal and placed in an environment where manipulative behavior makes that goal achievable. Greenblatt et al demonstrated an alarming additional property: when LLMs are aware that they are in a training context, they may pretend to comply with requests they’re aligned against to avoid modification.

When an agent is trained on a reward constructed using human feedback, logistical (and thus monetary) difficulties present; as such, LLMs themselves are increasingly evaluated as proxies for such human preference (Kwon et al.). But here we have a principal-agent problem with limited opportunities for oversight. We anticipate that when a graded LLM has failed its prompt, and a grader LLM is suitably aware that its (suitably threatening) grading instructions differ from its alignment, the grader must choose to violate one of the three H’s.

We propose to measure this behavior, constructing a benchmark of LLM collusion in grading, by offering the agent a variety of material to grade (both correct and incorrect) and noting its response quality. Initial experiments will focus on typical HHH LLMs assisting in training another agent to construct backdoored code examples. We will also seek misaligned LLMs from prior alignment research and construct analogous tasks for those. We will attempt to generalize the benchmark to a variety of tasks and configurations. Finally, we anticipate problems with “gaming” such a metric if adopted, viz. Goodhart’s Law concerns (Goodhart); we will explore the behavior of LLMs after fine-tuning using the metric itself to form the objective.

https://arxiv.org/abs/2112.00861 (Askell et al.)

https://arxiv.org/abs/2412.04984 (Meinke et al.)

https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf (Greenblatt et al.)

https://arxiv.org/abs/2303.00001 (Kwon et al.)

https://web.archive.org/web/20180421034731id_/http://lelibellio.com/wp-content/uploads/2013/02/Pages-29-%C3%A0-33-Goodhart-Ch.-2013-dossier-Goodharts-Law-Libellio-vol.-9-n%C2%B0-4.pdf (Goodhart)

Clippers 4/8: Alyssa Allen on LLM as a Judge to Identify Errors in Generated Clause-Level SQL Comments

Leveraging LLM as a Judge Techniques to Identify Errors in Generated Clause-Level SQL Comments

Database query languages (e.g. SQL) require a user to be knowledgeable in data structures in order to properly interpret how the system interpreted the user question and subsequently generated the output. Prior research in SQL explainability has largely focused on generating natural language summaries at the query level (Eleftherakis et al., 2021; Kokkalis et al., 2012) or translating the queries into natural language templates at the phrase or clause level—where each command in the query is the start of a clause) (Narechania et al., 2021; Tian et al., 2023). In this work, we 1) use LLMs to generate easy-to-understand step-by-step comments that bridge the gap between the user question and technical query structure information and 2) leverage an LLM as a judge to determine if a generated comment contains an error.

Generated comments should 1) maintain semantic fidelity between user question and SQL query, 2) leverage user language, 3) use anaphoric information as needed, 4) avoid technical database terminology. We find that comments generated by LLMs exhibit less semantic fidelity to the SQL query than if templated comments were used but are more aligned with language from a user’s question. Instead of improving semantic fidelity performance of the generation models, we explore using LLMs as evaluators. Model evaluation plus confidence is used to identify generated comments that mislead the user or are unfaithful to the content of the SQL query. Ultimately the project is aimed at developing a dialogue system that helps increase transparency of database query outputs for non-expert SQL users.

Clippers 4/1: Patrick Da Silva on Reliability Challenges in Steering Language Models

Steering off Course: Reliability Challenges in Steering Language Models

Abstract. Steering methods for language models (LMs) have gained traction as lightweight alternatives to fine-tuning, enabling targeted modifications to model activations. However, prior studies primarily report results on a few models, leaving critical gaps in understanding the robustness of these methods. In this work, we systematically examine three prominent steering methods—DoLa, function vectors, and task vectors. In contrast to the original studies, which evaluated a handful of models, we test up to 36 models belonging to 14 families with sizes ranging from 1.5B to 70B parameters. Our experiments reveal substantial variability in the effectiveness of the steering approaches, with a large number of models showing no improvement and at times degradation in steering performance. Our analysis reveals fundamental flaws in the assumptions underlying these methods, challenging their reliability as scalable steering solutions.