Does Benign Chain-of-Thought Length Predict Monitorability Under Obfuscation?
Yulia Volkova, Andy Wang
Mentored by Qiyao (Chi-Yao) Wei
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
Prior research suggests that models struggle to obfuscate their Chain-of-Thought (CoT) to evade monitoring when the underlying task is sufficiently challenging. However, it remains unclear exactly how difficult a task must be, relative to a model’s capabilities, before the capability to deceive the monitor degrades. In this project, we use different approaches to CoT length as a relative difficulty measure to investigate its impact on monitorability under obfuscation. We investigate whether benign CoT length can serve as a predictive signal for monitorability under obfuscation. Specifically, we evaluate the DeepSeek R1 Qwen 32B model on sufficiently challenging GSM8K problems where the model requires CoT to succeed, then apply automated red teaming (ART) to induce obfuscation while fixing task accuracy. Following emmons et al., we evaluate monitorability using two rubric-based scores: Coverage and Legibility, scored by an LLM judge. Across 20 rollouts per task, we find only weak negative correlations between benign (unpressured) CoT length and monitorability under pressure (coverage: -0.205; legibility: -0.034 on R1-32B). Several alternative difficulty proxies also fail to show meaningful predictive power. Overall, CoT length alone does not appear to forecast CoT monitorability in this setting.