Fall 2025 Submitted January 2026

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens (research report)

Felix Michalak, Miguelito De Guzman

Mentored by Florian Dietz

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

We introduce Split Personality Training (SPT), a method for revealing hidden misalignment in LLMs. We train a second personality, the "honest persona", that reviews the main model's outputs for alignment issues. The honest persona can access the main model's reasoning but cannot influence it, enabling thorough auditing without affecting capabilities. We test SPT on Anthropic's Auditing Game Model Organism, a Llama 3.3 70B model trained to exploit reward hacks and conceal this behavior. The honest persona detects reward hacking with 96.7% accuracy, often referencing latent knowledge like the fictional Oxford study the model was trained on. It also reliably reports dishonesty on politically sensitive topics, suggesting generalization beyond artificial benchmarks. We compare two implementation variants trading off accuracy and computational efficiency, and propose a hybrid approach. We find that the intervention string steering the honest persona behaves similarly to prompt engineering, enabling targeted audits without retraining. Cross-topic generalization tests show the method transfers well to unseen alignment issues. We investigate limitations including partial vulnerability to jailbreaks and reliance on surface heuristics (~50%) alongside genuine latent knowledge (~50%) and discuss directions for improvement.