Probes to forecast Privilege-escalation of LLM-agents
Zenyi Gomez, Devashish Lohni
Mentored by Samuel Brown
Working report from the SPAR program. May not reflect the authors' current views.
Abstract
LLM agents increasingly execute system-level actions such as running shell commands, editing files, and calling external services. These actions can be legitimate, but they also create a high-impact attack surface: an agent that performs unauthorized system level actions like internet access or privilege escalation (e.g., via sudo or misconfiguration abuse) can bypass sandbox boundaries and access sensitive files. Existing monitoring approaches based on natural-language rationales or chain-of-thought (CoT) can be unreliable, expensive, and easy to game.
In this project we study activation-based monitoring: we train lightweight linear probes on internal activations of a target model to predict whether a piece of text (a prompt, code snippet, or an agent-log segment) indicates intent to (i) escalate privileges (PE) or (ii) access the internet (IA). We build PE and IA datasets designed to reduce trivial keyword cues, train probes layer-by-layer on Llama activations, and evaluate both in-distribution and on Python code. Across both tasks, the learned probes substantially outperform lexical baselines and remain informative under modality shift, and a qualitative Inspect case study shows probe scores rise specifically during real escalation attempts in a sandboxed environment.