Automating interpretability with agents
Jack Payne, Inbal Meir, Julius Vidal
Mentored by Georg Lange
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
Unsupervised sparse dictionary learning methods like Sparse Autoencoders (SAE) disentangle LLM representations into interpretable features without explicitly explaining what those features represent. LLMs can be used to automatically generate and evaluate feature explanations, but often lack comprehensiveness or misinterpret complex patterns. We hypothesize this occurs from insufficient context as current automated methods fail to capture the nuance of feature investigation typically conducted by researchers. In this work, we identify failure modes and develop an AI agent that circumvents those. Specifically, we show that existing methods cannot reliably detect output-centric features (features that are best explained by how they shape model behavior) and scope (how narrow to define the feature). We develop an agentic explainer that can iteratively run causal experiments to test and refine feature explanations, optimizing for explanation correctness and human interpretability. We demonstrate that our agentic explainer successfully addresses several identified failure modes and produces more refined explanations. However, the agent’s explanations don’t outperform existing methods, and we propose SAE quality and the scoring method as possible reasons. We further investigate if agentic explanations are more succinct and parsimonious than baseline.