Early mech interp exploration of character trained models
Emmett Brosowsky, Riya Tyagi, Iwona Kotlarska
Mentored by Clément Dumas
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
Frontier labs like Anthropic use constitutional AI as post-training pipelines, to shape the model’s persona to follow some principles. This includes OpenAI’s model specs, or Anthropic’s soul document. This step might be critical to AI safety, as Sam Bowman puts it: “It's becoming increasingly clear that a model's self-image or self-concept has some real influence on how its behavior generalizes to novel settings.” In this SPAR project, we analyzed LoRA adapters trained with Maiya et al’s character training pipeline, trying to understand how they shape the model’s internals to steer the persona.