Hardening LLMs: Inside the Architecture of Secure AI Sandboxing

How many times this week have you seen a demo of an AI agent "autonomously" fixing a bug, writing a script, or interacting with a database? It’s the hottest trend in software engineering right now. We are moving rapidly from static RAG (Retrieval-Augmented Generation) pipelines to autonomous AI agents that can execute code, run shell commands, and browse the web on our behalf.

But as developers, a cold shiver should run down your spine when you think about this. If you give an LLM like Claude or GPT-4 the power to write and execute code, what happens when it gets hit with a prompt injection attack? What happens if it generates a recursive loop that bombs your host resources, or worse, decides to run rm -rf / inside your production cluster?

Recently, Anthropic shared some fascinating insights into how they contain Claude across their products. It got me thinking: how do we, as developers building AI-powered applications, build robust, secure sandboxes to run LLM-generated code? Today, we are going inside the architecture of secure AI sandboxing. Grab your coffee, and let’s dive in.

The Threat Model of the AI Agent

Before we write a single line of configuration, we need to understand our threat model. When you let an LLM execute code (for example, using a Python interpreter tool to calculate a complex math equation or plot a graph), you are essentially running untrusted, arbitrary user input as executable code. This is the classic remote code execution (RCE) vulnerability, except it's a feature, not a bug.

The risks fall into three main categories:

Resource Exhaustion (DoS): The LLM generates code that runs an infinite loop, allocates infinite memory, or spawns millions of processes (fork bomb), crashing the host system.
Data Exfiltration and Network Attacks: The LLM-generated code runs a script that scans your internal network, accesses metadata services (like AWS 169.254.169.254), or uploads sensitive database credentials to an external server.
Host Compromise: The LLM escapes its execution environment, gains root access to the host machine, and compromises your entire cloud infrastructure.

To mitigate these risks, we cannot rely on "prompt engineering" or telling the system prompt "please do not run malicious commands." The LLM is untrusted. We must contain it at the infrastructure level.

The Layers of Defense-in-Depth

Standard Docker containers are not a secure sandbox boundary. "Containers are not security boundaries" is a mantra we must repeat daily. Since containers share the host kernel, a kernel exploit can allow an attacker (or a compromised LLM) to break out of the container to the host.

To safely run LLM code, we need a multi-layered defense-in-depth architecture. Here is how a production-grade AI sandbox looks under the hood:


+-------------------------------------------------------------+
|                     Host OS / Hypervisor                    |
|                                                             |
|  +-------------------------------------------------------+  |
|  |             MicroVM Layer (Firecracker / gVisor)      |  |
|  |                                                       |  |
|  |  +-------------------------------------------------+  |  |
|  |  |           Isolated Docker Container             |  |  |
|  |  |                                                 |  |  |
|  |  |  +-------------------------------------------+  |  |  |
|  |  |  |             Restricted Runtime            |  |  |  |
|  |  |  |   - No Network (or strict firewall)       |  |  |  |
|  |  |  |   - Read-only Root Filesystem             |  |  |  |
|  |  |  |   - Seccomp & AppArmor Profiles           |  |  |  |
|  |  |  +-------------------------------------------+  |  |  |
|  |  +-------------------------------------------------+  |  |
|  +-------------------------------------------------------+  |
+-------------------------------------------------------------+

Layer 1: MicroVMs (The Hard Boundary)

Instead of bare Docker, we should run LLM code inside lightweight MicroVMs. Technologies like AWS Firecracker or Google's gVisor provide the speed of containers with the security isolation of traditional Virtual Machines. Firecracker, written in Rust, uses the Linux Kernel-based Virtual Machine (KVM) to create and manage microVMs in milliseconds, making it perfect for dynamic, on-demand code execution tasks.

Layer 2: gVisor (Interception-based Isolation)

If Firecracker feels like too much infrastructure overhead, gVisor is a fantastic alternative. gVisor is an open-source container runtime that implements a user-space kernel (called the "Sentry") written in Go. It intercepts all system calls made by the application and handles them in user space, preventing direct access to the host kernel.

To run a Docker container using gVisor (runsc), you simply configure your Docker daemon or specify the runtime in your compose file:

# docker-compose.yml
version: '3.8'
services:
  ai-executor:
    image: python:3.11-slim
    runtime: runsc # This tells Docker to use gVisor
    volumes:
      - ./sandbox_scripts:/workspace
    working_dir: /workspace
    command: python exec_code.py

Hardening the Container Internals

Even inside a MicroVM or gVisor, we must apply the principle of least privilege within the execution container itself. Let's look at three critical configurations you need to implement.

1. Read-Only Root Filesystem and Ephemeral Storage

The code execution environment should be stateless and read-only. This prevents malicious scripts from modifying system binaries, installing backdoor packages, or persisting files. If the code needs to write data (e.g., generating an image or a CSV report), mount an ephemeral, size-limited tmpfs volume to a specific directory (like /tmp).

docker run --read-only \
           --tmpfs /tmp:rw,size=50m,noexec,nosuid,nodev \
           my-ai-executor-image

In this example, /tmp is writable, but limited to 50MB, and configured with noexec (preventing any execution of binaries written to /tmp), nosuid, and nodev.

2. Restricting Network Access

An LLM running Python code should not be able to talk to your internal network or make outbound requests to the internet unless absolutely necessary. If your agent is just doing data processing or math, disable the network stack entirely:

docker run --network none my-ai-executor-image

If the agent must access specific APIs, use local firewall rules (iptables/nftables) or an egress proxy to whitelist only specific domains. Never let an untrusted container communicate with the AWS metadata service (169.254.169.254) or your internal database VPC.

3. Drop Linux Capabilities and Restrict System Calls (Seccomp)

By default, Docker containers run with a subset of root privileges. For an AI sandbox, you should drop *all* capabilities and selectively add back only what is necessary (usually nothing). Additionally, use a Seccomp profile to restrict the system calls the container can make to the kernel.

docker run --cap-drop=ALL \
           --security-opt=no-new-privileges:true \
           --security-opt seccomp=custom-profile.json \
           my-ai-executor-image

Setting no-new-privileges:true is vital—it prevents processes inside the container from gaining new privileges via setuid or setgid binaries.

Implementing a Python Code Execution Sandbox

Let's look at a practical Python example of how you can structure a secure execution harness. Assume your web application accepts a block of Python code generated by an LLM, and you need to run it, capture the stdout/stderr, and return it safely to the user.

We'll use Python's subprocess module to run the code inside our hardened container, enforcing strict execution timeouts and resource limits.

import subprocess
import os
import resource

def set_resource_limits():
    """Restricts the resources allocated to the child process."""
    # Limit CPU time to 5 seconds
    resource.setrlimit(resource.RLIMIT_CPU, (5, 5))
    
    # Limit virtual memory (address space) to 100MB
    # This prevents Out-Of-Memory (OOM) issues on the host
    mem_limit = 100 * 1024 * 1024  # 100 MB
    resource.setrlimit(resource.RLIMIT_AS, (mem_limit, mem_limit))
    
    # Limit max processes to prevent fork bombs
    resource.setrlimit(resource.RLIMIT_NPROC, (10, 10))

def execute_untrusted_code(code_string: str) -> dict:
    # Save the generated code to a temporary directory mounted as tmpfs
    temp_file_path = "/tmp/generated_code.py"
    with open(temp_file_path, "w") as f:
        f.write(code_string)
        
    try:
        # Run the code as a non-root user (uid 1000) inside our sandbox
        # preexec_fn sets the resource limits right before execution starts
        result = subprocess.run(
            ["python3", "-I", temp_file_path], # -I isolated mode (ignores env variables)
            capture_output=True,
            text=True,
            timeout=5, # Hard wall-clock timeout
            user=1000, # Run as non-privileged user
            preexec_fn=set_resource_limits
        )
        
        return {
            "success": result.returncode == 0,
            "stdout": result.stdout,
            "stderr": result.stderr
        }
        
    except subprocess.TimeoutExpired:
        return {
            "success": False,
            "stdout": "",
            "stderr": "Execution timed out. Maximum limit is 5 seconds."
        }
    finally:
        # Clean up the generated file immediately
        if os.path.exists(temp_file_path):
            os.remove(temp_file_path)

Why this approach is resilient:

Isolated Mode (python3 -I): Prevents Python from looking at the user's environment variables or local directories for malicious imports.
Strict Limits (resource.setrlimit): Restricts CPU, memory, and process spawning directly at the system level before the interpreter executes the untrusted script.
Non-root User: Executing code as UID 1000 ensures that even if a system call is allowed, the script lacks the permissions to touch system configurations.

Monitoring and Anomaly Detection

No sandbox is 100% impenetrable. A crucial aspect of containment is observability. You must monitor your sandboxed environments for anomalous behavior. If an AI container suddenly starts trying to access local files like /etc/passwd, making high volumes of DNS queries, or trying to write to read-only directories, you need to know instantly.

Consider implementing tools like Falco (a CNCF cloud-native runtime security tool). Falco uses eBPF to monitor system calls in real-time. You can write a custom Falco rule that alerts you if an AI execution container attempts to spawn an unexpected shell or make an unauthorized system call:

- rule: Unauthorized Shell Spawn in AI Sandbox
  desc: Detects when a shell is spawned inside an AI execution container
  condition: container.image.repository == "my-ai-executor" and spawned_process and proc.name in (bash, sh, zsh)
  output: "Shell spawned inside AI Sandbox (user=%user.name %container.info parent=%proc.pname cmdline=%proc.cmdline)"
  priority: WARNING

Wrapping Up: Build Safely First

As the capabilities of Claude, GPT, and open-source models grow, our responsibility to secure the execution layers of our software grows too. Building an AI-assisted application is incredibly exciting, but skipping runtime containment is a recipe for disaster. By leveraging microVMs/gVisor, applying strict system limits, dropping privileges, and monitoring with eBPF, you can build a robust, production-ready environment that keeps your infrastructure secure while giving your AI agents room to innovate.

How are you handling secure code execution in your AI projects? Are you using gVisor, or are you running everything in Kubernetes with strict network policies? Let me know in the comments below, or reach out on Twitter/X!

Until next time, keep coding, keep building, and stay secure.