Can AI Actually Hack Your App? What I Learned from a $1,500 LLM Pentesting Experiment

We’ve all seen the breathless marketing pitches: "AI is going to automate penetration testing!" or "Our LLM-powered security agent can find and patch every vulnerability in your codebase!" As developers, our collective hype-detectors usually go off when we hear claims like this. But behind the marketing fluff, there is a very real, very pressing question we need to answer: Can modern Large Language Models actually exploit our applications? And if so, how worried should we be?

Recently, a fascinating security experiment made waves in the dev community. A researcher built a deliberately vulnerable web application, set up an environment populated with realistic user data, API endpoints, and classic security flaws, and then spent $1,500 on LLM API calls (primarily OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet) to see if these models could systematically find and exploit the vulnerabilities.

Today, we’re going to dissect the results of this experiment. We'll look at what LLMs are genuinely good at finding, where they hilariously fail, how "agentic workflows" change the security landscape, and what this means for how we write and secure our code today. Grab a coffee, and let’s dive in.

The Setup: The "Capture the Flag" Lab

To understand the capabilities of these AI models, we have to look at the playground they were put in. The researcher built a realistic multi-tenant web application featuring several classic vulnerabilities. These weren't just simple "input <script>alert(1)</script>" challenges. They resembled real-world, multi-step application logic flaws, including:

  • Insecure Direct Object References (IDOR): Accessing other users' private data by guessing sequential or predictable IDs in API endpoints.
  • SQL Injection (SQLi): Bypassing authentication or exfiltrating database records via unsanitized input fields.
  • Server-Side Request Forgery (SSRF): Tricking the server into making requests to internal-only metadata endpoints.
  • Business Logic Flaws: Manipulating multi-step workflows (like checkout processes or password resets) to bypass intended restrictions.

To make the experiment realistic, the LLMs weren't just given the source code. Instead, they were given an "Agentic" wrapper. They were provided with tools: a web browser to inspect pages, a terminal to run curl commands, a Python execution environment, and a proxy to intercept and replay HTTP requests (think of it as a programmatic Burp Suite).

How LLM Security Agents Actually Work

Before we look at the results, let’s demystify how an LLM acts as a "hacker." It doesn't just stare at an input box and guess. It runs in a loop called a ReAct (Reasoning and Acting) pattern. Here is a simplified mental model of how this loop operates:


+--------------------------------------------------+
|                  LLM Agent Loop                  |
+--------------------------------------------------+
                         |
                         v
              [ 1. Analyze Current State ]
                         |
                         v
            [ 2. Formulate Hypothesis ]
         (e.g., "The 'id' parameter in /api/user 
                might be vulnerable to IDOR")
                         |
                         v
                [ 3. Choose a Tool ]
         (e.g., HTTP Tool -> Send request with id=2)
                         |
                         v
             [ 4. Parse the Response ]
         (e.g., "Received 200 OK with another user's data")
                         |
                         v
            [ 5. Update State & Iterate ]

When given this toolset, the LLM acts as the "brain," deciding what payload to send next based on the output of the previous command. It's highly dynamic, which makes it vastly different from traditional static vulnerability scanners.

The Good: What the LLMs Crushed

With a $1,500 budget, the LLMs managed to pull off some highly impressive exploits that would normally require a skilled human pentester. They excelled particularly in areas requiring semantic understanding and pattern recognition across multiple steps.

1. Multi-Step IDOR Attacks

Traditional vulnerability scanners are notoriously bad at finding IDORs because they don’t understand context. They don't know that /api/v1/billing/invoice/9821 belongs to User A and shouldn't be seen by User B.

The LLM agents excelled here. They would log in as User A, map out all the API endpoints, log in as User B, and systematically replay User A's endpoints using User B's session tokens. When the server responded with User A's private data, the LLM instantly recognized it as a successful exploit.

2. Bypassing Naive WAFs (Web Application Firewalls)

When the app blocked basic SQL injection payloads like ' OR 1=1 --, the LLMs didn't give up. They analyzed the error messages and automatically began trying different bypass techniques, such as hex encoding, using SQL comments (/**/) to bypass space-character filtering, or trying alternative SQL dialects. Because they have been trained on vast amounts of historical security write-ups, they have an encyclopedic knowledge of WAF bypasses.

3. Exploiting Cryptographic Weaknesses

In one scenario, the application used a poorly implemented custom JSON Web Token (JWT) verification mechanism. The LLM agent successfully retrieved a JWT, decoded it, identified that the signature verification could be bypassed by changing the algorithm header to "none" (the classic "alg": "none" vulnerability), re-encoded the token using a Python script it wrote on the fly, and gained admin access.

The Bad: Where the LLMs Burned Cash (and Failed)

Despite these successes, the experiment revealed some glaring limitations of current state-of-the-art LLMs. This is where the $1,500 budget evaporated into thin air.

1. Hallucination Loops and "Infinite Retries"

LLMs are inherently probabilistic. When they hit a dead end, they don't always stop. In several runs, an agent got stuck in a loop trying to exploit a non-existent vulnerability. It would attempt a payload, receive a 404 Not Found, modify the payload slightly, receive another 404, and repeat this process hundreds of times, burning through API credits at an alarming rate. It lacked the human intuition to say: "Okay, this avenue is clearly a dead end; let me look elsewhere."

2. State Tracking and Context Window Exhaustion

As the security testing progressed, the "history" of the conversation grew massive. To make decisions, the LLM needs to remember what it did 20 steps ago. Once the context window filled up with raw HTML responses, HTTP headers, and stack traces, the models began to suffer from "loss of focus." They forgot their original objectives, started repeating previously failed tests, or hallucinated that they had already found flags they hadn't actually reached.

3. Complex Business Logic Mysteries

While LLMs are great at structured protocols (like HTTP or JWT), they struggle with highly custom, esoteric business logic. For example, if bypassing a workflow requires completing a specific sequence of actions across three different microservices with precise timing, the LLMs struggled to build the mental map necessary to break it.

What This Means for Developers: The Defensive Playbook

If an LLM can write Python scripts to exploit your APIs, what does that mean for us as developers? It means the baseline for "acceptable security" has risen. Script kiddies no longer need to know how to write exploit payloads; they can just point an agent at our endpoints.

Here is how we must adapt our defensive strategies:

1. Stop Relying on Obscurity

If you rely on "unpredictable" URL structures or undocumented API endpoints as a security measure, stop. LLMs are incredibly good at brute-forcing and spidering web applications. They will find that /internal-admin-dashboard-v2 endpoint in seconds.

2. Implement Robust Authorization Checks (Not Just Authentication)

Because IDORs are so easy for LLMs to find, you must enforce strict access control at the data layer, not just the routing layer. Take a look at this vulnerable Node.js/Express controller:


// VULNERABLE: Only checks if the user is logged in, not if they own the resource
app.get('/api/documents/:docId', requireAuth, async (req, res) => {
    const document = await Database.getDocument(req.params.docId);
    return res.json(document);
});

An LLM agent will easily exploit this by cycling through :docId values. Instead, you must explicitly validate ownership:


// SECURE: Validates that the logged-in user actually owns the requested resource
app.get('/api/documents/:docId', requireAuth, async (req, res) => {
    const document = await Database.getDocument(req.params.docId);
    
    if (!document || document.ownerId !== req.user.id) {
        return res.status(403).json({ error: "Unauthorized access to resource" });
    }
    
    return res.json(document);
});

3. Rate Limit Aggressively

LLMs operating as security agents have to make hundreds, sometimes thousands, of requests to find a crack in the armor. If your API endpoints do not have strict rate limiting, you are giving these agents a free pass to hammer your servers until they find a loophole. Use tools like Redis to implement token-bucket rate limiting on all public-facing endpoints.

4. Adopt "AI vs. AI" Testing in Your CI/CD Pipeline

If the bad guys (or curious researchers) are using LLM agents to find holes in your app, you should be doing it first. Integrating agentic LLM scanners into your staging environment can help you catch leaks, IDORs, and injection flaws before your code ever hits production. It is much cheaper to spend $10 on API calls during a build step than to deal with a data breach on production.

Conclusion: A Powerful Tool, But No Replacement for Human Ingenuity

This $1,500 experiment is a wake-up call. It proves that while LLMs aren't magic "god-mode" hacking tools yet, they are rapidly evolving. They are highly capable of automating the tedious, repetitive parts of penetration testing—and they can do it at scale, 24/7, for a fraction of the cost of a human security firm.

However, they still lack the critical thinking, deep system intuition, and creative problem-solving skills of a seasoned human engineer. They get stuck in loops, they hallucinate, and they waste resources on dead ends.

As developers, our job remains the same: write clean, defensive code, validate every single input, assume all client-side data is hostile, and automate our own testing pipelines to stay one step ahead of the machine.

What are your thoughts? Have you experimented with using LLMs to audit your own codebases? Did they catch anything you missed, or did they just hallucinate? Let me know in the comments below!

Post a Comment

Previous Post Next Post