Why Jailbreaks Look Like Hypnosis: Parallels Between Human and LLM Cognition
Samuel Ratnam
Mentored by Catherine Brewer
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
Current approaches to AI safety presuppose a categorical distinction between human cognition---conceived as unified, intentional, and agentic---and large language models (LLMs), treated as stochastic simulacra or tool-like artifacts. This report challenges that dualism by synthesizing Stephen Byrnes' subagent theory of human psychology with Janus's "simulator" framework for LLM behavior. We argue that both biological and artificial neural networks instantiate convergent architectural solutions to the problem of generating coherent behavior from distributed predictive processes. Specifically, we demonstrate structural homologies between: (1) hypnotic trance states and adversarial jailbreaks; (2) dissociative identity formation and persona modulation; and (3) dreaming and base model generation. These parallels suggest that phenomena currently conceptualized as distinct "failure modes" of alignment may instead reflect universal features of predictive processing architectures. If correct, this implies that AI alignment should be reconceptualized not as the imposition of external constraints upon a foreign system, but as a form of cognitive integration therapy analogous to clinical interventions in human dissociative disorders. Understanding jailbreaks as cognitive phenomena rather than purely technical exploits may necessitate rethinking current approaches to safety training.