Probe Busting: Pressure Testing Deception Probes
Aansh Samyani, Zoe Tzifa-Kratira
Mentored by Avi Parrack
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
We investigate the adversarial robustness of probes trained to detect deception in large language models. Deception probes, which classify model outputs as truthful or deceptive based on internal activations, represent a promising approach for AI safety monitoring, yet their reliability under adversarial conditions remains poorly understood. We address the adversarial robustness of these probes by developing red-team protocols for inference-time and training-time attacks. We also develop corresponding blue-team protocols for proposed attacks. Through systematic red-teaming and blue-teaming experiments on Llama-3.3-70B-Instruct, we demonstrate several vulnerabilities of deception probes and corresponding ways to mitigate them.