Automating Interpretability Research in Deep Learning Models
Matthew Shinkle, Yeonwoo Jang
Mentored by Jacques Thibodeau
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
As AIs become more capable, automating AI research and development is emerging as a critical pathway to advance model interpretability and overall AI safety. This project develops a set of tools that integrate into a pipeline for parsing research papers, retrieving and understanding relevant codebases, and designing and running experiments. We present techniques for improving interpretability research by AI agents, including paper search and parsing, codebase discovery and preparation, remote execution, and automated package documentation. We demonstrate these tools through a sandbox environment for sparse autoencoders (SAEs) that enables autonomous implementation and evaluation of diverse SAE variants.
Our framework includes tools for discovering and processing research papers to identify key ideas, methodologies, and performance metrics. It provides methods for finding, validating, and processing codebases associated with research papers. The system supports experiment design and execution through cloud-based GPU instances, with features for configuration management and result collection. We show that these components can be combined to automate aspects of interpretability research, using SAE variants as a proof of concept. This approach may be expanded to other interpretability tasks as the underlying tools mature.