When LLMs are built to be chat assistants, a common goal of post-training is the production of models that are Helpful, Honest, and Harmless (HHH) (Askell et al.). Recent work shows that this goal isn’t universally achieved; Meinke et al. found that models can be nudged towards scheming (dishonesty and covert pursuit of goals) when prompted to strongly follow a goal and placed in an environment where manipulative behavior makes that goal achievable. Greenblatt et al demonstrated an alarming additional property: when LLMs are aware that they are in a training context, they may pretend to comply with requests they’re aligned against to avoid modification.
When an agent is trained on a reward constructed using human feedback, logistical (and thus monetary) difficulties present; as such, LLMs themselves are increasingly evaluated as proxies for such human preference (Kwon et al.). But here we have a principal-agent problem with limited opportunities for oversight. We anticipate that when a graded LLM has failed its prompt, and a grader LLM is suitably aware that its (suitably threatening) grading instructions differ from its alignment, the grader must choose to violate one of the three H’s.
We propose to measure this behavior, constructing a benchmark of LLM collusion in grading, by offering the agent a variety of material to grade (both correct and incorrect) and noting its response quality. Initial experiments will focus on typical HHH LLMs assisting in training another agent to construct backdoored code examples. We will also seek misaligned LLMs from prior alignment research and construct analogous tasks for those. We will attempt to generalize the benchmark to a variety of tasks and configurations. Finally, we anticipate problems with “gaming” such a metric if adopted, viz. Goodhart’s Law concerns (Goodhart); we will explore the behavior of LLMs after fine-tuning using the metric itself to form the objective.
https://arxiv.org/abs/2112.00861 (Askell et al.)
https://arxiv.org/abs/2412.04984 (Meinke et al.)
https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf (Greenblatt et al.)
https://arxiv.org/abs/2303.00001 (Kwon et al.)