Discovering novel concepts in LLMs: Extraction of important, complex, yet learnable features
Simon Elias Schrader, Kamal Maher
Mentored by Kola Ayonrinde
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
AI systems often outperform humans on complex tasks. This superior performance does not always arise from faster execution of human-like reasoning, but from qualitatively different internal algorithms and representations. Mechanistic interpretability can isolate these representations as features, many of which show clear patterns, while others appear uninterpretable. However, we lack a general method to identify apparently uninterpretable features that prove interpretable upon closer inspection. Here, we present a method to isolate these features in large language models (LLMs) using sparse autoencoders (SAEs). We filter for SAE features that are important to model behavior, measured by direct logit attribution, and complex based on low automated interpretability (AutoInterp) scores. We then use LLM agents to generate improved explanations about feature behaviour and evaluate feature learnability via increases in detection accuracy as agents iteratively refine explanations. We apply our method to Gemma 2 2B and find an increase of 6 percentage points in detection accuracy, with many complex features corresponding to abstract grammatical concepts. Our results suggest that some complex features may be learnable by humans, enabling the transfer of novel concepts to humans, with relevance for scientific discovery and AI safety.