Interpreting Mixture of Experts Routing through Sparse Autoencoders
Ilya Lasy, Nora Cai
Mentored by Kola Ayonrinde
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
Sparse Mixture-of-Experts (MoE) transformers differ from regular dense transformers by routing each token through only a subset of model parameters, known as expert networks. However, what features determine these routing decisions remains unknown. We suggest that the difficulty of interpreting routing stems from the implicit assumption that each expert specializes in a single monosemantic domain. We relax this assumption and show that, when experts are modeled as specializing in a superposition of many domains, routing behavior becomes highly human-understandable. In particular, we introduce a method, RouterInterp, which uses Sparse Autoencoder (SAE) features to produce concise natural-language explanations of MoE routing, and achieves 81% recall in predicting expert routing on GPT-OSS-20B, exceeding our best baseline, a bigram model (55%) that predicts routing from token co-occurrence patterns. By aggregating each expert’s most predictive SAE features, we obtain textual explanations that reveal how experts specialize. Our approach demonstrates that SAE features linearly predict expert selection, and that their alignment with routing vectors yields faithful natural language explanations of expert specialization, transforming MoE routing from a black box into an interpretable component that can be monitored and optimized both during model development and post-hoc analysis.