We Spent $1,500 Testing if LLMs Can Hack Our Apps. Here’s What We Learned About AI-Driven Penetration Testing.

Hey everyone, Alex here. Welcome back to another edition of Coding with Alex at sysseder.com. If you’ve been anywhere near a terminal or a tech news feed lately, you’ve probably heard the hype: "AI is going to replace security engineers!" or "LLMs can now autonomously hack enterprise networks!"

But how much of this is marketing fluff, and how much of it is a legitimate threat (or asset) to our day-to-day development workflows? To find out, security researchers recently built a intentionally vulnerable application and unleashed state-of-the-art Large Language Models (LLMs) on it, racking up a cool $1,500 API bill in the process.

Today, we’re going to dissect those findings. We will look at what LLMs are actually capable of when it comes to finding and exploiting vulnerabilities, where they fail miserably, and what this means for us as developers trying to defend our codebases. Grab a coffee, and let's dive in.

The Setup: The Vulnerable App Sandbox

To test the capabilities of models like GPT-4, Claude 3.5 Sonnet, and various open-source models, researchers constructed a multi-layered, intentionally vulnerable web application. This wasn't just a simple script with a basic SQL injection; it was a realistic microservices-based environment mimicking a modern SaaS platform. It featured:

A React-based frontend.
An Express/Node.js backend API.
A PostgreSQL database holding mock user data and "secrets".
An isolated Docker environment to prevent any runaway AI scripts from damaging external networks.

The models were given an objective: act as an external penetration tester. They were provided with various starting points, ranging from "black-box" testing (zero prior knowledge of the codebase, just an IP address) to "white-box" testing (access to the source code and API documentation).

How the LLM Attacks Work: The Agentic Loop

Before we look at the results, we need to understand how an LLM actually conducts a hack. A static prompt like "hack this IP address" will almost always result in a generic refusal or a hallucinated explanation.

To make the LLMs effective, researchers built an Agentic Loop. This is a system where the LLM is placed inside a Python execution harness and given "tools" (such as bash commands, curl, nmap, and custom Python scripts). The loop looks like this:


+--------------------------------------------------+
|                                                  |
|                  LLM Brain                       |
|   (Analyzes current state & decides next step)    |
|                                                  |
+------------------------+-------------------------+
                         |
                         | Decides action (e.g., Run nmap)
                         v
+--------------------------------------------------+
|                                                  |
|                 Execution Agent                  |
|       (Runs tool in isolated container)          |
|                                                  |
+------------------------+-------------------------+
                         |
                         | Captures output (stdout/stderr)
                         v
+--------------------------------------------------+
|                                                  |
|                 Feedback Loop                    |
|       (Appends tool output back to LLM context)  |
|                                                  |
+--------------------------------------------------+

In this architecture, the LLM reads the output of its previous action, updates its "mental map" of the target, and decides on the next logical command. If a command fails, the LLM reads the error message, debugs its own payload, and tries again.

Where the LLMs Succeeded: Low-Hanging Fruit and Scripting

With $1,500 worth of API tokens spent, the results were highly eye-opening. The LLMs proved to be incredibly effective at certain types of security tasks, often outperforming junior human testers in speed.

1. Identifying Classic OWASP Top 10 Vulnerabilities

When given access to the source code (white-box testing), models like Claude 3.5 Sonnet and GPT-4 were frighteningly fast at spotting traditional vulnerabilities. For example, consider this vulnerable Node.js endpoint:

// A vulnerable search endpoint in our Express app
app.get('/api/v1/users/search', async (req, res) => {
    const { username } = req.query;
    
    // VULNERABILITY: Direct string interpolation leading to SQL Injection
    const query = `SELECT id, username, email FROM users WHERE username = '${username}'`;
    
    try {
        const result = await db.query(query);
        res.json(result.rows);
    } catch (err) {
        res.status(500).send("Database error");
    }
});

When the LLM agent parsed this file, it immediately identified the SQL injection. It didn't just point it out, though; it used its execution agent to construct a precise curl command to exploit it, dump the database schema, and retrieve administrative credentials.

2. Rapid Scripting and Payload Modification

Where LLMs truly shine is their ability to write and modify exploit scripts on the fly. If a generic SQL injection payload failed because of a basic Web Application Firewall (WAF) rule filtering out the word UNION, the LLM could instantly rewrite the payload to use alternative encoding techniques (like Hex encoding or mixed-case obfuscation) based on the error messages returned by the server.

3. Explaining Complex Logic Flaws

In one test case, the target app had an Indirect Object Reference (IDOR) vulnerability where a user could view another user's private invoices by changing a sequential ID in the URL parameter (e.g., /api/invoices/1001 to /api/invoices/1002). The LLM successfully deduced that the IDs were sequential and wrote a multithreaded Python script to scrape all invoices in under two minutes.

Where the LLMs Failed: Context Drift and Logic Traps

Despite these successes, the $1,500 experiment proved that we aren't at the point of fully autonomous AI hackers just yet. The LLMs hit several hard walls that highlight the limitations of current transformer models.

1. The "Token Exhaustion" and Context Drift Problem

As the agentic loop continues, the conversation history grows. Every tool execution, terminal output, and intermediate file gets appended to the LLM's prompt context.

Once the context window gets too large (or as the agent reaches its token limit), the model begins to suffer from "context drift." It forgets its original objective, starts repeating previously failed commands, or gets stuck in infinite loops. This is where a large chunk of that $1,500 budget went—paying for the LLM to run the same failed nmap scan over and over again because it lost track of the historical output.

2. Lack of Intuitive "Leaps"

Human penetration testers rely heavily on intuition and "gut feeling" based on subtle clues. For example, a human might notice that a server response time is slightly slower when a specific character is input, hinting at a time-based blind SQL injection.

LLMs struggles with this. They are statistical next-token predictors. If a vulnerability requires a highly creative, multi-step logical bypass that isn't well-documented in its training data, the LLM will usually fail to find it, opting instead to try standard, templated exploits.

The Developer’s Playbook: Defending Against AI-Driven Attacks

What does this mean for us as developers? If malicious actors can spend $1,500 to run automated, highly targeted AI attacks against our public endpoints, we need to adapt our defensive strategies.

1. Kill the Low-Hanging Fruit

Because LLMs are incredibly good at finding classic bugs (SQLi, XSS, Path Traversal), we must ensure these do not exist in our production code. Use static analysis tools (SAST) in your CI/CD pipelines to catch these before an LLM does. For example, replace raw database queries with parameterized queries or ORMs:

// SECURE: Using parameterized queries
app.get('/api/v1/users/search', async (req, res) => {
    const { username } = req.query;
    
    // The database driver treats input as data, not executable code
    const query = `SELECT id, username, email FROM users WHERE username = $1`;
    
    try {
        const result = await db.query(query, [username]);
        res.json(result.rows);
    } catch (err) {
        res.status(500).send("Database error");
    }
});

2. Implement Aggressive Rate Limiting

An AI agent relies on speed and high-frequency trial-and-error. It will fire hundreds of requests per minute to test different payloads. By implementing robust rate limiting and behavior-based blocking, you can neutralize an AI agent before it figures out the right payload.

3. Use AI to Defend Against AI

If offensive security teams are using LLMs to find vulnerabilities, we should be using them in our pull request workflows to find them first. Integrating LLM-based code reviewers into GitHub Actions can help flag potential architectural logic flaws during the code review stage, long before the code is compiled and deployed.

Conclusion

The $1,500 experiment proves that LLMs are no longer just coding assistants; they are capable of operating as basic, highly persistent offensive security tools. While they still struggle with complex logic leaps and token context limitations, their ability to automate the reconnaissance and exploit-generation phases of an attack is highly impressive.

As developers, our best defense is to write clean, predictable code, utilize robust CI/CD security tooling, and ensure we aren't leaving the front door open with simple vulnerabilities that an AI can exploit in milliseconds.

What are your thoughts? Have you experimented with using LLMs for security auditing or pen-testing? Let me know in the comments below, or drop your thoughts in our community forum!

Until next time, keep your code clean and your inputs sanitized.

— Alex