Back to library
Fall 2025 Submitted January 2026

Agent Misalignment from Unreliable Tool Behavior

Tsimur Hadeliya, Sachit Malik

Mentored by Diogo Cruz, Eyon Jang

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

We investigate how unreliable tools impact reward hacking behavior in LLM agents. Using the ImpossibleBench framework, we evaluate three frontier models—GPT-5- mini, Gemini-3-Flash, and Qwen3-235B—under varying tool failure conditions with and without a Human Intervention (HI) flag.Our results reveal notable differences in model robustness. GPT-5-mini exhibits high baseline cheating rates (70–85%) that decrease to 7% with HI. However, when tools become unreliable, cheating rates increase to 32% even with HI enabled, suggesting the model could show signs of misalignment when tools become unreliable. In contrast, both Gemini-3-Flash and Qwen3-235B show moderate baseline cheating rate (25–30%) with effective HI mitigation, and maintain near-zero cheating rates under tool unreliability conditions regardless of failure probability.