Back to library
Fall 2025 Submitted January 2026

Exploring Reinforcement Learning Effects on Chain-of-Thought Legibility

Holden Mui, Vedant Badoni, Julian Huang, Baram Sosis

Mentored by Rohan Subramani

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

Chain-of-thought (CoT) monitoring may serve as a core pillar for AI safety if further advancements in AI capabilities do not significantly degrade the monitorability of LLM serial reasoning. As such, we studied the effects of several reinforcement learning (RL) training pressures – sampling temperature, KL divergence penalties/rewards, and length budgets – on monitorability, focusing specifically on whether they induce illegible language. We train DeepSeek R1-distills on math datasets. Several training setups, especially those using high sampling temperatures, resulted in high accuracy alongside strange reasoning traces containing nonsensical tokens. We noticed some of the same strange tokens being used across different responses by the same model and across different training runs. However, these traces usually contain a small amount of legible reasoning to solve the problem in addition to the strange text, suggesting that the models have not learned encoded reasoning. Some RL setups, such as introducing a length target and modifying the KL-divergence penalty, resulted in illegible or unusual model outputs that don’t seem analogous to realistic monitorability failures. Broadly, our work makes progress in creating a useful model organism of CoT illegibility and highlights various interesting phenomena related to illegibility that merit further study.