Back to library
Fall 2025 Submitted January 2026

Fine-Tuning Monitors

Kaushik Reddy, Anders Woodruff

Mentored by Rauno Arike

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

High-quality monitors remain prohibitively expensive, and smaller models likely have untapped potential for improved monitoring performance, as they are not typically optimized for this task. We hypothesize that training smaller models specifically for monitoring could significantly enhance the Pareto frontier of monitor efficiency and effectiveness. In this project, we explore a range of supervised fine-tuning strategies to train dedicated monitors. We then evaluate these methods in terms of weak-to-strong generalization and robustness to red-team attacks. Our goal is to identify techniques that improve monitoring performance across different model families and capability levels.