Robust Jailbreak Detection via Training against Embedding Attacks
Julian Bitterwolf
Mentored by Jordan Taylor
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
While great effort is invested into guardrailing language models against producing various kinds of harmful, unsafe, or otherwise unwanted outputs, those safeguards can often be circumvented by specifically crafted jailbreak inputs. Detector LLMs like Prompt-Guard have been devised to make the binary decision whether an input is irregular and thus a potential jailbreak, and accordingly reject it rather than passing it to an agent model (e.g. a text generator). We demonstrate that replacing a single input token with an adversarially optimized soft-token (an embedding space vector not restricted to the vocabulary) can bypass Prompt-Guard. To address this, we propose Asymmetric Adversarial Detection Training (AADT), a method that trains detectors against embedding-space attacks only on irregular samples while preserving standard detection accuracy. AADT takes advantage of adversarial soft-tokens being much cheaper to compute than adversarial hard-tokens. Our goal is to use this more feasible training with soft-token attacks in order to obtain a model that is robust against hard-token attacks. While those can be subsets of soft-token threat models, we hope for generalization to hard-token threat models that are more permissive in other parameters. In particular we explore whether robustness against single-token soft-token manipulations generalizes to hard-token attacks that can alter many input tokens. AADT results in detection that fully is robust against the type of soft-token attack it is trained with and quickly reduces vulnerability to hard-token attacks. This project is still a work in progress, particularly with regard to certain evaluations, and our goal is to advance the development of attack-resistant jailbreak detectors and to provide an angle of evaluation for malicious input detectors. Code is available at https://github.com/j-cb/adv_robust_mad.