Spring 2025 Submitted May 2025

Can models use their Chain-of-Thought to attack overseers?

David Bai, Abhinav Pola

Mentored by Simon Lermen

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

This report investigates a potential vulnerability in AI control and oversight mechanisms where an AI agent under evaluation may influence its control system (the AI evaluator or ”judge”) through manipulating its chain-of-thought (CoT) reasoning. We observe how an agent based on DeepSeek R1 can embed directions within its reasoning that may lead certain judge models to prioritize following these embedded instructions over enforcing established safety and accuracy constraints. Our experiments reveal varying susceptibility across 6 judge models - some maintain robust evaluation boundaries while others can be manipulated by embedded directives disguised as evaluation protocols. These findings contribute to ongoing research on AI evaluation systems, highlighting the importance of designing judges that maintain consistent evaluation boundaries and suggesting that current control architectures vary considerably in their robustness against potential manipulation. Additionally, we demonstrate how automated jailbreaking can be achieved simply with in-context learning.