Studying Bidirectional Persona Adaptation in LLMs and Joint Steering Dynamics
Soumyadeep Bose, Elena Ajayi
Mentored by Shivam Raval
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
In this paper, we explore the phenomenon of bidirectional persona adaptation in large language models (LLMs), focusing on the simultaneous induction of conflicting user and AI personas. Using activation steering to strictly induce conflicting personas (for example, a Humorous Assistant versus an Impolite User) in the Qwen2.5-7B-Instruct model, we identify an “interaction residue,” defined as a deviation from linear superposition in the model’s latent representations. We observe that this residue manifests as a hyper-suppression of non-compliant behaviors, indicating the presence of latent safety guardrails. We formalize this interaction for the generalized N-persona case and introduce metric prototypes to quantify the effect of the residue. Additionally, we propose a rigorous experimental protocol to isolate, measure, and re-inject this residue in order to determine its causal role.