Agent Misalignment from Unreliable Tool Behavior
Tsimur Hadeliya, Sachit Malik
Mentored by Diogo Cruz, Eyon Jang
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
We investigate how unreliable tools impact reward hacking behavior in LLM agents. Using the ImpossibleBench framework, we evaluate three frontier models—GPT-5- mini, Gemini-3-Flash, and Qwen3-235B—under varying tool failure conditions with and without a Human Intervention (HI) flag.Our results reveal notable differences in model robustness. GPT-5-mini exhibits high baseline cheating rates (70–85%) that decrease to 7% with HI. However, when tools become unreliable, cheating rates increase to 32% even with HI enabled, suggesting the model could show signs of misalignment when tools become unreliable. In contrast, both Gemini-3-Flash and Qwen3-235B show moderate baseline cheating rate (25–30%) with effective HI mitigation, and maintain near-zero cheating rates under tool unreliability conditions regardless of failure probability.