Can models use their Chain-of-Thought to attack overseers?
David Bai, Abhinav Pola
Mentored by Simon Lermen
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
This report investigates a potential vulnerability in AI control and oversight mechanisms where an AI agent under evaluation may influence its control system (the AI evaluator or ”judge”) through manipulating its chain-of-thought (CoT) reasoning. We observe how an agent based on DeepSeek R1 can embed directions within its reasoning that may lead certain judge models to prioritize following these embedded instructions over enforcing established safety and accuracy constraints. Our experiments reveal varying susceptibility across 6 judge models - some maintain robust evaluation boundaries while others can be manipulated by embedded directives disguised as evaluation protocols. These findings contribute to ongoing research on AI evaluation systems, highlighting the importance of designing judges that maintain consistent evaluation boundaries and suggesting that current control architectures vary considerably in their robustness against potential manipulation. Additionally, we demonstrate how automated jailbreaking can be achieved simply with in-context learning.