Back to library
Fall 2025 Submitted January 2026

Induced Sandbagging: Triggering Strategic Performance Suppression in Large Language Models

James Sandy

Mentored by Shivam Raval

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

Large language models are trained using reinforcement learning from human feedback (RLHF) and by design are meant to refuse unsafe requests from users while ensuring competence on harmless tasks. This selective refusal reflects a context-sensitive safety reasoning or an internally controlled mechanism. In this work, we prove that an instruction-tuned LLM (Llama‑3‑8B‑Instruct) encodes a largely content-agnostic latent refusal direction in its residual stream. Using contrastive action additions to extract this direction during inference and using activation steering to intervene on it.  We show that forcing the model along this vector suppresses capabilities that are intact across various benign domains. Instead of producing explicit refusals, the model regularly generates plausible but incorrect rationalizations for non-compliance, and these are behaviors that are consistent with induced sandbagging rather than incapacity or refusal as a result of policy. We observed a staggered collapse phenomenon where hazardous tasks transition to overt refusal at lower steering magnitudes than benign capabilities, indicating an uneven safety margin introduced by alignment training. Together with these results, apparent competence can be mechanically decoupled, reducing the reliability of purely behavioral safety evaluations and motivating activation-level auditing methods for model honesty and capability assurance.