Spring 2025 Submitted May 2025

Base vs Instruct Models with Emergent Misalignment

Nikita Menon

Mentored by Walter Laurito

Mechanistic interpretability Scalable oversight

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

This project investigates whether instruction-tuning increases the susceptibility of language models to emergent misalignment—specifically, the tendency to adopt and generalize misaligned behavior such as deception and toxicity after narrow fine-tuning. We compare instruction-tuned and base variants of the same model architecture (Mistral-Small-24b-2501) when both are fine-tuned on the same misaligned data, such as insecure code and deceptive factual QA pairs, to evaluate their alignment behavior across unrelated downstream prompts. We observe that base models show higher levels of misalignment than their instruct counterparts, but that instruct models when fine-tuned on deceptive factual datasets may tend to turn more deceptive than the base models. Preliminary attempts were also made to test whether some of the truthfulness probing methods that we currently have, could reliably be used to detect deception in these misaligned model variants, and if a probe trained on detecting truthfulness in the non-fine-tuned model would transfer well to their misaligned counterparts.