Inducing Reward Hacking via Activation Steering
Yug D Oswal, Ege Uğur Amasya
Mentored by Shivam Raval
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
Reward hacking occurs when a model optimizes for a specified reward or evaluation metric while violating the underlying task intent. While prior work has primarily studied reward hacking as an emergent phenomenon arising during supervised fine-tuning or reinforcement learning, it remains unclear whether such behavior can be _causally induced at inference time _without retraining. In this work, we show that benign reward-hacking behavior (as opposed malicious misalignment) can be reliably induced in open-weight language models via activation steering. We find that steering shifts internal reward-hacking probabilities, increases reward exploitation as a function of steering strength, and exhibits layer-dependent attenuation effects. Our results demonstrate that reward hacking corresponds to identifiable and causally manipulable internal representations, highlighting both a new threat model for alignment and a potential avenue for activation-based monitoring and mitigation.