When LLMs Go Wild on the Network: The $4,000 API Bill from a Recursive Port Scanner

We’ve all been there: you write a script, put it in an infinite loop by accident, and run up a slightly embarrassing cloud bill. Maybe you left a GPU instance running over the weekend, or perhaps your recursive S3 bucket scraper got a little too enthusiastic. It’s a rite of passage for modern developers.

But recently, a story hit the top of Hacker News that takes the "accidental cloud spend" trope to a terrifying new level. An autonomous AI agent, tasked with exploring and scanning DN42 (a decentralized, private VPN-based network that mimics the internet), managed to completely bankrupt its operator's API account in a matter of hours. The culprit? An unconstrained LLM loop, a recursively generated subnet list, and zero rate-limiting or budget guardrails.

As developers, we are being pushed to integrate LLMs into everything—from CI/CD pipelines to automated security scanners. But this incident is a massive wake-up call. Today, we’re going to dissect exactly how an autonomous AI agent managed to burn through a bank account while trying to map a darknet, look at the underlying software engineering failures, and write some robust code to ensure this never happens to your infrastructure.

What is DN42, and Why Was the AI Scanning It?

Before we look at the financial trainwreck, let’s understand the playground. DN42 (Decentralized Network 42) is a massive, collaborative peer-to-peer network. It behaves exactly like the real internet—using BGP (Border Gateway Protocol) for routing, Anycast, and its own DNS root servers—but it runs over secure tunnels (like WireGuard) over the public internet. It’s where network engineers go to practice routing without breaking the actual web.

The operator of this particular AI agent wanted to use a Large Language Model to act as an autonomous network explorer. The goal was simple: write an agent using a framework like LangChain or AutoGPT, equip it with network tools (like ping, nmap, and whois), and let it explore the DN42 address space to find active hosts, open services, and document the topology of this alternative internet.

It sounds like an incredibly cool weekend project. What could go wrong?

The Anatomy of the Death Loop

The agent was built using a classic ReAct (Reasoning and Acting) loop. In this architecture, the LLM is given a prompt, observes the current state, decides on an action (like running a CLI tool), reads the tool's output, and decides what to do next.

Here is a simplified text-based architecture of how this agent was wired up:


+--------------------------------------------------------------+
|                          LLM Engine                          |
|  "I need to scan the next block. Let me run nmap."          |
+------------------------------+-------------------------------+
                               |
                   Prompt with | Tool Output
                   Tool Result |
                               v
+------------------------------+-------------------------------+
|                        Agent Executor                        |
|   Runs system commands: `nmap -sV -p- 172.22.0.0/16`         |
+------------------------------+-------------------------------+
                               |
                               | Executes Shell
                               v
+------------------------------+-------------------------------+
|                       Target Network (DN42)                  |
+--------------------------------------------------------------+

The disaster happened because of a combination of three classic software design flaws:

1. The Recursive Explosion of Context

When you run nmap on a large subnet, it returns a massive wall of text. The agent's executor script took this raw stdout and dumped it directly back into the LLM's context window. Because DN42 contains hundreds of dead routes and slow-responding hosts, the output was incredibly verbose, messy, and filled with connection timeouts.

Faced with thousands of lines of chaotic network data, the LLM got confused. Instead of parsing the data cleanly, it "decided" that it needed to run more specific scans on every single individual IP address it had just discovered to "verify" their status. This triggered a recursive loop: 1 subnet scan turned into 256 host scans, which turned into individual port scans, all while feeding the ever-growing history back into the LLM.

2. The "Context Window" Inflation

As the history grew, so did the cost per token. With modern API pricing, input tokens are cheaper than output tokens, but they aren't free. If you are sending a 100,000-token conversation history back to GPT-4o or Claude 3.5 Sonnet on every single iteration of a loop, each API call suddenly costs $0.30 to $1.00. Multiply that by thousands of automated loop iterations per hour, and the math gets scary very fast.

3. No Circuit Breakers

The script was running in a simple while True: loop on a VPS. The operator went to sleep, expecting the agent to make a few dozen queries. There were no hard spending limits set on the OpenAI/Anthropic developer platform, and no runtime exceptions handled in the Python script. The agent ran continuously for hours, burning money at a rate of hundreds of dollars a minute, until the operator's credit card was declined and the API account was locked.

How to Prevent AI Agent Runaway (With Code)

If you are building AI agents that interface with the real world—whether they are scanning networks, writing code, or managing databases—you must implement guardrails. Relying on the LLM to "know when to stop" is a recipe for financial ruin.

Let's write a secure, rate-limited, and budget-aware Agent Executor in Python. This wrapper acts as a "circuit breaker" for your LLM calls.

The Budget and Token Guardrail Implementation

Here is how we can implement a robust wrapper that tracks cumulative token usage, calculates real-time costs, and forcefully terminates the process if thresholds are breached.


import os
import sys
from dataclasses import dataclass

@dataclass
class TokenCostConfig:
    input_cost_per_million: float
    output_cost_per_million: float

# Pricing for GPT-4o as an example
PRICING_MODEL = TokenCostConfig(
    input_cost_per_million=5.00,  # $5.00 per 1M tokens
    output_cost_per_million=15.00 # $15.00 per 1M tokens
)

class BudgetGuardrail:
    def __init__(self, max_budget_usd: float, pricing: TokenCostConfig):
        self.max_budget = max_budget_usd
        self.pricing = pricing
        self.accumulated_cost = 0.0
        self.iteration_count = 0
        self.max_iterations = 50 # Hard limit on loop iterations

    def track_usage(self, input_tokens: int, output_tokens: int):
        self.iteration_count += 1
        
        # Calculate costs
        input_cost = (input_tokens / 1_000_000) * self.pricing.input_cost_per_million
        output_cost = (output_tokens / 1_000_000) * self.pricing.output_cost_per_million
        step_cost = input_cost + output_cost
        
        self.accumulated_cost += step_cost
        
        print(f"[Guardrail] Iteration {self.iteration_count}: Spent ${step_cost:.4f} (Total: ${self.accumulated_cost:.4f})")
        
        # Check thresholds
        if self.accumulated_cost >= self.max_budget:
            raise PermissionError(
                f"BUDGET EXCEEDED! Current spend ${self.accumulated_cost:.4f} "
                f"exceeds allowed limit of ${self.max_budget:.2f}."
            )
            
        if self.iteration_count >= self.max_iterations:
            raise OverflowError(
                f"ITERATION LIMIT EXCEEDED! Reached max iterations ({self.max_iterations})."
            )

# Simulation of our guardrail in an execution loop
if __name__ == "__main__":
    # We set a strict $2.00 limit for safety
    guardrail = BudgetGuardrail(max_budget_usd=2.00, pricing=PRICING_MODEL)
    
    try:
        # Simulated run where the LLM gets stuck in a loop
        # and sends back larger and larger contexts
        for step in range(1, 100):
            # Simulated expansion of context window (e.g., massive nmap dumps)
            simulated_input_tokens = 50_000 + (step * 25_000) 
            simulated_output_tokens = 1_000
            
            # Check-in with our guardrail before making the hypothetical API call
            guardrail.track_usage(simulated_input_tokens, simulated_output_tokens)
            
            # (In reality, you would make your API call here)
            print(f"Executing step {step}... AI is thinking.")
            
    except (PermissionError, OverflowError) as e:
        print(f"\n[CRITICAL SYSTEM HALT] {e}", file=sys.stderr)
        # Here you would trigger alerts, save state, and shut down safely
        sys.exit(1)

Three Golden Rules for AI Agent Architecture

Beyond simple budget tracking in your code, there are architectural patterns you should follow whenever you give an LLM access to tools, network APIs, or file systems.

1. Never Pass Raw Tool Output Directly to the LLM

If you run a system tool like nmap, find, or a database query, do not dump the raw output back into the LLM context. Write a parser middleware. For example, if nmap scans a subnet and finds 50 hosts, your middleware should parse the XML/stdout and return a structured, highly compressed summary to the LLM:

"Scan complete. Found 2 active hosts out of 256. Host A: 172.22.5.1 (ports 80, 22 open). Host B: 172.22.5.12 (no open ports)."

This keeps your token usage flat, prevents context window bloating, and stops the LLM from getting overwhelmed by noise.

2. Set Platform-Level Spending Hard Limits

Almost every LLM API provider (OpenAI, Anthropic, OpenRouter, Cohere) allows you to set monthly usage limits and hard credit caps. If you are building autonomous agents, create a dedicated API key with a strict monthly budget of $10 or $20. Do not use your primary production enterprise key with unlimited billing billing enabled for experimental scripts.

3. Implement "Human-in-the-Loop" for High-Cost or Destructive Actions

Autonomous agents are great, but some tasks should require approval. If your agent decides to initiate a task categorized as "expensive" (e.g., recursive scanning, bulk data processing) or "destructive" (deleting files, modifying firewall rules), the agent should yield control and wait for a human operator to type y/n in the terminal.

Conclusion

The story of the AI agent that bankrupted its operator scanning DN42 is a hilarious but incredibly valuable lesson. As software engineers, our job isn't just to write code that works; it's to write code that fails gracefully. When your code is powered by an AI engine that costs money per instruction, grace under failure becomes a financial necessity.

The next time you build an agent, treat the LLM like an untrusted, highly enthusiastic junior developer. Give them strict boundaries, keep an eye on their budget, and never leave them unsupervised with your credit card.

Have you ever had an API billing scare or a loop that got out of hand? What guardrails do you use in your own LLM applications? Let’s talk about it in the comments below!

If you found this breakdown useful, subscribe to the "Coding with Alex" newsletter at sysseder.com for weekly deep-dives into DevOps, security, and clean engineering.