Regularization Guided Crosscoder for cross-layer feature detections
Maxim Panteleev, Maxim Finenko
Mentored by Yuxiao Li
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to decompose activations into monosemantic features within single layers. Crosscoders extend this approach by learning sparse correspondences between features across layers. However, existing methods rely on naive loss functions and ignore structural constraints reflecting feature distribution and similarity. We propose additional reframes in loss function to original crosscoder models, which incorporate empirical correlation structure. In addition to VAE extension of vanilla crosscoder loss functions our method leverages sentence/token similarity graphs and feature co-activation patterns to guide cross-layer alignment.