Fall 2025 Best Poster, 3rd place Submitted January 2026

Probe Busting: Pressure Testing Deception Probes

Aansh Samyani, Zoe Tzifa-Kratira

Mentored by Avi Parrack

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

We investigate the adversarial robustness of probes trained to detect deception in large language models. Deception probes, which classify model outputs as truthful or deceptive based on internal activations, represent a promising approach for AI safety monitoring, yet their reliability under adversarial conditions remains poorly understood. We address the adversarial robustness of these probes by developing red-team protocols for inference-time and training-time attacks. We also develop corresponding blue-team protocols for proposed attacks. Through systematic red-teaming and blue-teaming experiments on Llama-3.3-70B-Instruct, we demonstrate several vulnerabilities of deception probes and corresponding ways to mitigate them.