In December 2025, a security researcher disclosed 30+ vulnerabilities across the most popular AI-powered IDEs — Cursor, Windsurf, GitHub Copilot, Cline, and more. The root cause was nearly identical across all of them: agents were executing code with the permissions of the local user, inside shared process environments, with no meaningful isolation boundary between the AI-generated code and everything else on the machine.
The researcher named the disclosure “IDEsaster.” The name stuck.
None of these incidents required breaking cryptography or exploiting memory corruption bugs. They required something much simpler: AI agents that trusted their own output enough to run it without isolation.
The core problem
When a DevOps team deploys a Docker container running their own application, they largely know what is in it. When an AI agent generates code — in response to a user prompt, a CI trigger, or an incoming webhook — neither the developer nor the agent can fully verify what the code will do at runtime.
The numbers are uncomfortable:
- AI-generated code contains 2.74x more security vulnerabilities than human-written code
- Only 55% of AI code generation tasks produce secure code across tested models (Veracode 2025)
- AI-generated code adds over 10,000 new security findings per month across studied repositories — a 10x increase from late 2024
- 83% of companies plan to deploy AI agents in 2026
Docker containers use Linux namespaces and cgroups to isolate processes while sharing the host kernel. This is a security boundary against accidental interference, not a security boundary against adversarial code. A container breakout via a kernel vulnerability is a single step away from host access and lateral movement to every other container on the host.
The three isolation tiers
The industry has converged on three distinct approaches to sandboxing AI agent code execution, each with different security guarantees and performance tradeoffs.
Tier 1: Docker containers (insufficient for untrusted code)
Containers start in ~50ms, have minimal performance overhead, and are the default deployment unit for every team on the planet. They are also not enough when the code running inside them was generated by an LLM.
The specific failure mode: Docker-in-Docker for agent tool execution requires --privileged mode, which dramatically weakens isolation. At that point, the isolation guarantees are effectively gone.
Use containers when: You control the code. Single-tenant. Trusted workloads only.
Tier 2: gVisor (the middle ground)
Google’s gVisor implements a user-space kernel that intercepts system calls before they reach the host kernel. When a container makes a syscall, gVisor’s Sentry process handles it in user space, drastically reducing kernel attack surface. Instead of hundreds of syscalls reaching the host kernel, gVisor allows only a minimal, vetted subset.
- Cold start: ~100ms
- Performance overhead: 20-50% on I/O-heavy workloads
- Security model: Syscall-level isolation. Stronger than containers, weaker than VMs.
- Used by: Modal, Google (Agent Sandbox on GKE)
The tradeoff: not all syscalls are perfectly emulated, which can cause compatibility issues with some Linux software.
Tier 3: MicroVMs (the gold standard)
Firecracker, developed by AWS for Lambda and Fargate, creates lightweight virtual machines with minimal device emulation. Each microVM runs its own Linux kernel inside KVM — hardware-level isolation where a compromised guest process cannot reach the host kernel or laterally pivot to other sandboxes.
- Cold start: ~125-150ms
- Performance overhead: Low (native inside VM)
- Memory overhead: Less than 5 MiB per VM
- Density: Up to 150 microVMs per second per host, ~4,000 per host
- Security model: Hardware virtualization. Attacker must escape guest kernel AND hypervisor.
- Used by: AWS Lambda, AWS Fargate, E2B, Cloudflare Workers
Kata Containers wraps this in Kubernetes-native orchestration. From Kubernetes’ perspective, it is a normal container. Under the hood, it is a full VM with hardware isolation.
WebAssembly (emerging, limited)
WebAssembly provides language-level sandboxing with sub-10ms cold starts, but no filesystem, no network, and no persistent state. It works for browser-based and edge compute, but is not suitable for most AI agent workloads that need OS integration.
The comparison
| Technology | Cold Start | Isolation Strength | Performance Overhead | Best For |
|---|---|---|---|---|
| Docker | ~50ms | Weak (shared kernel) | Minimal | Trusted application code |
| gVisor | ~100ms | Medium (syscall interception) | 20-50% | Controlled AI workloads |
| Firecracker | ~125ms | Strong (HW virtualization) | Low (native in VM) | Untrusted agent code |
| Kata | ~200ms | Strong (HW virtualization) | Low | Multi-tenant Kubernetes |
| WebAssembly | <10ms | Strong (language sandbox) | High | Edge/stateless functions |
The sandbox platform race
A new infrastructure category has emerged: sandbox-as-a-service platforms that abstract the isolation technology behind SDK calls.
E2B has become the reference implementation. Every sandbox runs in a Firecracker microVM. Python and TypeScript SDKs make provisioning a sandbox a one-liner. Sessions persist for up to 24 hours. The open-source core has driven adoption across major agent frameworks.
Modal takes a different approach with gVisor-based isolation and native GPU access. The right choice when your agent needs to run inference, not just scripts.
Daytona provides full development environments, not just execution sandboxes — persistent workspaces with IDE integration for AI coding agents.
Cloudflare Workers puts V8 isolates at the edge for sub-10ms cold starts and global distribution, but with no persistent filesystem.
Docker itself has entered the race with a hardened container approach, acknowledging that standard containers are insufficient for agent workloads.
What sandboxes provide that VMs and containers do not
The key insight is that sandboxes are ephemeral by design. A VM is long-lived infrastructure you provision and maintain. A container is a running service. A sandbox is created for a single task and destroyed after.
This changes the security calculus:
- No state leakage between tasks — each code execution gets a fresh environment
- SDK-native provisioning —
sandbox = Sandbox()is one line of code, not infrastructure engineering - Automatic cleanup — sandbox destroyed after task, no zombie containers
- Per-task kernel isolation — a compromised sandbox cannot reach the host or other sandboxes
- Minimal attack surface — Firecracker is 50,000 lines of Rust, not the entire Linux kernel
The decision framework
Is the code AI-generated or untrusted?
├── No → Docker container is fine
└── Yes → Does it need GPU access?
├── Yes → Modal (gVisor + GPU)
└── No → Does it need persistent filesystem?
├── Yes → E2B (Firecracker, 24h sessions)
└── No → Is it a stateless edge function?
├── Yes → Cloudflare Workers (V8 isolates)
└── No → Does it need a full dev environment?
├── Yes → Daytona
└── No → E2B (default choice)
What this means for your agent architecture
If you are building AI agents that generate and execute code — whether in an IDE, a CI pipeline, a chatbot, or an autonomous workflow — Docker alone is not your security boundary. It is your deployment unit. The security boundary needs to be a sandbox with hardware-level isolation.
The good news: the sandbox platforms have made this a one-line SDK call instead of an infrastructure project. The bad news: most teams are still running agents with local user permissions in shared process environments, hoping the model does not hallucinate an rm -rf.
The IDEsaster disclosure showed what happens when that hope is misplaced. The sandbox race is the industry’s answer.
This post is based on research published in June 2026, drawing from technical analyses by Northflank, Docker, AgentMarketCap, Paperclipped, SoftwareSeni, and the Firecracker/gVisor/Kata Containers documentation. The full research report is available in my [[AI Agent Sandboxing - Containers vs VMs vs MicroVMs Deep Research|research notes]].
Saram Consulting