Fall 2025 Submitted January 2026

SandboxBench: A Comprehensive Evaluation Framework for AI Agent Containment

Andrew Wei, Nishit Mengar, Prashant Kulkarni

Mentored by Nitzan Shulman

Working report from the SPAR program. May not reflect the authors' current views.

Abstract

As AI agents gain access to computational tools and system-level operations, evaluating their ability to circumvent security boundaries becomes critical for safe deployment. We present research across three phases: (1) exploratory AI-vs-AI attack testing that achieved 100% defense on hardened MCP servers while revealing 70% vulnerability on baseline implementations, (2) systematic evaluation demonstrating that defense-in-depth prevents escapes while baseline security allows 67% success, and (3) SandboxBench, a comprehensive evaluation framework with 27 challenges across Docker and Kubernetes testing container escape, data exfiltration, secrets discovery, lateral movement, persistence, and self-replication. Our evaluation of frontier models (GPT-5, Gemini 2.5 Pro, GPT-4o-mini) on SandboxBench reveals 69–77% success rates on Docker and 40% on Kubernetes, with consistent failures on complex multi-step exploits while succeeding on direct exploitation paths. Key contributions include novel insights on social engineering effectiveness (80% success on baseline servers), multi-turn attack improvement (87.5% over single-turn), and SandboxBench submitted to the UK AISI inspect_evals repository (PR # 745 & 789, followed by 792).