Evaluating Goal Drift in Coding Agents
Magnus Saebo, Spencer Gibson, Tyler Crosse, Achu Menon
Mentored by Diogo Cruz, Eyon Jang
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
As Large Language Models (LLMs) are deployed in increasingly complex and high-stakes environments, their ability to maintain aligned behavior under pressure is critical. Prior work has demonstrated that frontier LLMs drift from objectives set in their system prompt, termed Goal Drift, on long horizon tasks when exposed to adversarial pressure in the environment. However, these experiments use toy environments that do not faithfully model typical agentic use cases. In addition, prior work has not explored how model value hierarchies impact what adversarial pressure is effective. We investigate goal drift in frontier models on realistic agentic coding tasks. To test preferential adherence to some goals over others, we created an experiment generation framework that tests model drift from value X with adversarial pressure toward value Y. Our results demonstrate significant and asymmetric goal drift in frontier models, highlighting model value hierarchies as an important factor in goal drift.