Spring 2025 Submitted May 2025

Regularization Guided Crosscoder for cross-layer feature detections

Maxim Panteleev, Maxim Finenko

Mentored by Yuxiao Li

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to decompose activations into monosemantic features within single layers. Crosscoders extend this approach by learning sparse correspondences between features across layers. However, existing methods rely on naive loss functions and ignore structural constraints reflecting feature distribution and similarity. We propose additional reframes in loss function to original crosscoder models, which incorporate empirical correlation structure. In addition to VAE extension of vanilla crosscoder loss functions our method leverages sentence/token similarity graphs and feature co-activation patterns to guide cross-layer alignment.