Back to library
Fall 2025 Submitted January 2026

Evaluating the Universality of Linear Deception Representations

Aansh Samyani, Zoe Tzifa-Kratira, Tasha Pais, Erik Nordby

Mentored by Avi Parrack

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

Linear probes trained on internal activations have shown promise for detecting deceptive behavior in large language models, but the extent to which such signals are universal across model families, scales, and deception contexts remains an open question. We conduct a layer-wise evaluation of activation-based deception probes across instruction-tuned language models from the Gemma, Llama, and Qwen families, spanning 0.5B to 72B parameters. We analyze how probe performance scales with model size, where deception-related signals appear most linearly accessible within networks, and how well probes generalize across different forms of deception. Our results suggest that deception-related representations tend to become increasingly linearly accessible with scale, though they do not appear to be localized to consistent layers across architectures or tasks.