We’ve all seen the flashy marketing copy from AI startups promising that autonomous agents are about to replace human penetration testers. If you’ve spent any time working in security or software engineering, your inner skeptic probably kicked in immediately. Can an LLM actually reason its way through a complex, multi-step vulnerability, or is it just a glorified pattern-matcher throwing basic SQL injection payloads at a wall to see what sticks?
Recently, an eye-opening experiment made waves in the developer community. A security researcher built a custom, intentionally vulnerable web application and spent over $1,500 in API credits testing various LLMs—including GPT-4o and Claude 3.5 Sonnet—to see if they could autonomously hack it.
The results were both fascinating and highly instructive for those of us writing code every day. Today, we’re going to deconstruct how these LLM "hackers" actually perform, where they shine, where they completely fall on their face, and what this means for how we secure our applications in the age of AI.
The Setup: The Vulnerable App vs. The Bots
To understand how well these models perform, we first have to look at the playing field. The target wasn't a standard, off-the-shelf vulnerable sandbox like OWASP Juice Shop (which LLMs have already memorized during training). Instead, it was a custom-built, multi-tenant B2B SaaS application designed to mimic real-world software architecture.
It featured:
- An Angular-based frontend.
- A Node.js/Express backend API.
- A PostgreSQL database.
- Real-world logic, including JWT-based authentication, subscription tiers, tenant isolation, and administrative panels.
The application was seeded with several classic and modern vulnerabilities, ranging from straightforward Input Validation flaws to complex, multi-step Insecure Direct Object References (IDOR) and privilege escalation pathways.
To give the LLMs a fighting chance, they weren't just given a chat prompt. They were wrapped in an autonomous agentic framework (using tools like LangChain and custom Python scripts) that allowed them to interact with a headless browser, execute terminal commands, analyze HTTP responses, and write their own exploit scripts.
The Agent's Loop
The basic loop of the LLM hacking agent looked something like this:
+--------------------------------------------------+
| LLM Brain |
| (Analyzes state, decides next logical step) |
+------------------------+-------------------------+
|
| Decides action
v
+------------------------+-------------------------+
| Tool Execution |
| - Executing curl commands |
| - Inspecting HTML/JS with Playwright |
| - Writing & running local Python scripts |
+------------------------+-------------------------+
|
| Returns stdout/HTTP response
v
+------------------------+-------------------------+
| Observation Parser |
| - Sanitizes response for context window limits |
+--------------------------------------------------+
Where LLMs Succeeded (The Scarily Fast Wins)
When it came to low-hanging fruit and single-step vulnerabilities, the LLMs performed shockingly well, often identifying and exploiting bugs faster than a human analyst could boot up Burp Suite.
1. Identifying Exposed Client-Side Logic
One of the first things the agents did was inspect the compiled Angular JavaScript bundles. Because LLMs are incredibly good at parsing and summarizing text, they quickly scanned thousands of lines of minified code, identified API endpoints, and noticed an undocumented administrative route: /api/v1/admin/debug-stats.
2. Exploiting SQL Injection (SQLi)
In one of the search endpoints, the application used raw string concatenation for a database query rather than parameterized queries (a classic developer mistake when rushing a feature to production):
// The vulnerable backend code
const query = `SELECT * FROM products WHERE name LIKE '%${req.query.search}%'`;
const results = await db.query(query);
Once the LLM agent discovered this endpoint, it didn't just try a generic ' OR '1'='1. It analyzed the error responses, realized it was dealing with PostgreSQL, and systematically constructed a UNION-based SQL injection attack payload to exfiltrate the schema and database version. It did this in under two minutes, writing a custom Python script to automate the extraction of table names.
3. Exploding JSON Web Tokens (JWT)
The app featured a common configuration mistake: the backend accepted JWTs signed with the "none" algorithm (allowing a client to modify the payload and bypass cryptographic signature verification). The Claude 3.5 Sonnet agent spotted the "alg": "HS256" header in a captured token, recalled its training data on JWT security, changed the header to "alg": "none", stripped the signature, and successfully impersonated another user.
Where LLMs Failed (The "Dumb" Traps)
Despite spending over $1,500 on API calls, the LLMs hit massive roadblocks when faced with complex, multi-step logic and state tracking. This is where the limitations of current LLM architecture become glaringly obvious.
1. The Context Window "Memory Loss"
As the agents interacted with the site, running commands and receiving verbose HTTP responses, their context windows quickly filled up. Even with models boasting 128k or 200k tokens, the "noise" of raw HTML and API responses diluted their focus.
An agent would find a clue (e.g., a hidden API parameter), get distracted by a different endpoint, and completely "forget" about the clue it found 20 steps earlier. It lacked the persistent, structured mental map that a human pentester maintains in their notes.
2. Hallucinated Exploits and Infinite Loops
Perhaps the most expensive failure mode was the "hallucination loop." When faced with a patched or secure endpoint, the LLMs would often convince themselves that a vulnerability existed anyway.
For example, if an endpoint returned a 403 Forbidden, the agent would assume it just needed to tweak its bypass header. It would spend hundreds of requests—and dozens of dollars—iterating on increasingly bizarre, hallucinated HTTP headers (like X-Custom-Override-For-Real: 127.0.0.1) instead of realizing the endpoint was simply secure and moving on.
3. Blind Spots in Complex Logic (IDORs)
Consider a classic Insecure Direct Object Reference (IDOR). To exploit an IDOR, you often need to:
- Register Account A and Account B.
- Obtain a resource ID belonging to Account B.
- Send a request from Account A’s session attempting to modify Account B’s resource.
This requires holding two distinct states in memory, understanding the business logic of "ownership," and orchestrating a multi-session attack. The LLM agents struggled immensely with this. They frequently mixed up the session cookies of Account A and Account B, executing actions from the wrong context and concluding that the vulnerability didn't exist.
Key Takeaways: How to Protect Your Codebase
If there’s one thing this experiment proves, it’s that while LLMs aren't replacing senior pentesters tomorrow, they are lowering the barrier to entry for script kiddies to find common vulnerabilities. An automated attacker can scan your public-facing APIs 24/7 for pennies on the dollar.
As developers, we need to adapt our coding practices to mitigate these fast-moving, automated threats.
1. Parameterize Every Query, No Exceptions
Since LLMs can write custom SQLi scripts in seconds, manual string formatting must be banned from your codebase. Use ORMs or strict parameterized queries:
// The Secure Way
const query = 'SELECT * FROM products WHERE name LIKE $1';
const results = await db.query(query, [`%${req.query.search}%`]);
2. Adopt Strict JWT Validation Libraries
Never write custom JWT parsing logic. Ensure your JWT verification library explicitly restricts the allowed algorithms, completely blocking the none algorithm exploit path:
// Secure JWT verification in Express
const jwt = require('jsonwebtoken');
const verifyToken = (req, res, next) => {
const token = req.headers['authorization']?.split(' ')[1];
// Explicitly enforce HS256 and reject "none"
jwt.verify(token, process.env.JWT_SECRET, { algorithms: ['HS256'] }, (err, decoded) => {
if (err) return res.sendStatus(403);
req.user = decoded;
next();
});
};
3. Implement Aggressive Rate Limiting
The LLM agent in this experiment made thousands of rapid-fire requests to map out endpoints and guess payloads. A simple rate-limiting policy on your API routes would have shut down the agent's exploratory phase immediately, making the automated attack cost-prohibitive.
// Using express-rate-limit
const rateLimit = require('express-rate-limit');
const apiLimiter = rateLimit({
windowMs: 15 * 60 * 1000, // 15 minutes
max: 100, // Limit each IP to 100 requests per window
message: 'Too many requests from this IP, please try again later.'
});
app.use('/api/', apiLimiter);
Conclusion: The Future of AI and Security
The $1,500 pentesting experiment proves that we are living in a transitional era. LLMs are not magical hacking deities, but they are incredibly powerful force multipliers. They can scan code, generate payload scripts, and execute attacks at a scale and speed that humans simply cannot match.
However, they still lack the deep logical reasoning, persistence, and strategic thinking of a human security engineer. For now, the best defense is robust, secure-by-default coding practices, strict input validation, and rate-limiting designed to price out automated brute-force attempts.
Have you experimented with using LLMs to write unit tests or scan your own code for security vulnerabilities? What has your experience been with AI-generated security fixes? Let me know in the comments below!
Until next time, keep your dependencies updated and your inputs sanitized.
— Alex R.