Fall 2025 Submitted January 2026

Discovering novel concepts in LLMs: Extraction of important, complex, yet learnable features

Simon Elias Schrader, Kamal Maher

Mentored by Kola Ayonrinde

Mechanistic interpretability Philosophy of AI

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

AI systems often outperform humans on complex tasks. This superior performance does not always arise from faster execution of human-like reasoning, but from qualitatively different internal algorithms and representations. Mechanistic interpretability can isolate these representations as features, many of which show clear patterns, while others appear uninterpretable. However, we lack a general method to identify apparently uninterpretable features that prove interpretable upon closer inspection. Here, we present a method to isolate these features in large language models (LLMs) using sparse autoencoders (SAEs). We filter for SAE features that are important to model behavior, measured by direct logit attribution, and complex based on low automated interpretability (AutoInterp) scores. We then use LLM agents to generate improved explanations about feature behaviour and evaluate feature learnability via increases in detection accuracy as agents iteratively refine explanations. We apply our method to Gemma 2 2B and find an increase of 6 percentage points in detection accuracy, with many complex features corresponding to abstract grammatical concepts. Our results suggest that some complex features may be learnable by humans, enabling the transfer of novel concepts to humans, with relevance for scientific discovery and AI safety.