Experiments on Reward Hacking Monitorability in Language Models
Ihor Protsenko
Mentored by Kei Nishimura-Gasparian
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
Reward hacking, when models exploit misspecified objectives rather than achieving intended goals, poses significant risks for AI safety, particularly when such behavior evades detection by a monitor. We investigate reward hacking monitorability in a coding environment using Qwen 2.5 Coder 32B Instruct on HumanEval tasks with deliberately incorrect test cases. We find that when training a model on a dataset of unmonitorable reward hacks, inoculation prompting can be used to reduce the rate the model learns to reward hack or become harder to monitor. Training on secretive reward hacking traces with inoculation prompts reduces the reward hack rate from 5.4% to 3.1% while increasing monitorability from 33.6% to 59.3% w.r.t to training without inoculation. We also observe an inverse relationship between prompt explicitness and detectability – prompts explicitly encouraging test-fitting produce high reward hack rates (23.0%) but are easily detected (93.7% monitorable) , while subtle prompts framing test-fitting as "understanding requirements" produce fewer (3.7%) but far stealthier reward hacks (58.3% unmonitorable). Finally, we show that monitorability is amenable to activation steering. Our results highlight the importance of studying hacking detectability alongside reward hacking prevalence.