Fall 2025 Submitted January 2026

Early mech interp exploration of character trained models

Emmett Brosowsky, Riya Tyagi, Iwona Kotlarska

Mentored by Clément Dumas

Developmental interpretability Mechanistic interpretability

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

Frontier labs like Anthropic use constitutional AI as post-training pipelines, to shape the model’s persona to follow some principles. This includes OpenAI’s model specs, or Anthropic’s soul document. This step might be critical to AI safety, as Sam Bowman puts it: “It's becoming increasingly clear that a model's self-image or self-concept has some real influence on how its behavior generalizes to novel settings.” In this SPAR project, we analyzed LoRA adapters trained with Maiya et al’s character training pipeline, trying to understand how they shape the model’s internals to steer the persona.