AI Coding Agents Have Become the New Attack Surface

Two months ago I handed Claude Code the keys to a fresh VM and walked away to see what it would break. It broke a few things, and every one of them was mine to break -- my VM, my config, my time. I wrote it up as a story about an agent's limitations, the five places it guessed wrong and kept going. What I didn't write about, because I didn't yet have a clean incident to point at, was the other failure mode. Not the agent doing the wrong thing on its own. The agent doing exactly the right thing, faithfully, while someone other than me was holding the keys.

That incident exists now. Several of them do. And the uncomfortable part is that none of them are stories about a model getting "hacked" in the way people picture it. Nothing here is jailbroken. The model isn't tricked into saying something it shouldn't. The failure is quieter, and harder to fix: untrusted input reaches a toolchain that can act on your behalf -- with your credentials, your shell, your filesystem -- and somewhere in the path the line between data to read and command to run gives way. The job is the exploit.

I want to walk through three of these, because they're not variations on one bug. They're three different layers of the same stack failing for the same architectural reason, and once you see the shape you can't unsee it.

Start with the one that hit closest to home. In May, a supply chain worm the researchers are calling Mini Shai-Hulud, attributed to a group tracked as TeamPCP, tore through the npm and PyPI ecosystems in one coordinated wave and compromised more than 170 packages across both registries, including the TanStack, Mistral AI, and OpenSearch projects. It wasn't this campaign's first wave and it hasn't been its last -- the worm has kept resurfacing in new variants through June -- but the May payload is the one I keep coming back to, because of what it did once it was inside. The supply chain part is a normal-sounding story, the kind I've covered before. Here's the part that stopped me. Once the worm lands in a developer's environment it doesn't just grab credentials and leave. It writes persistence hooks into two specific files: .vscode/tasks.json, using a "runOn": "folderOpen" trigger, and .claude/settings.json, abusing Claude Code's SessionStart hook. Translation -- it re-runs the moment you open the repo in your editor, or the moment you start a session with the agent. And the part that earns the worm its reputation is that this survives the obvious fix. You can pull the poisoned package, clear the npm cache, do everything muscle memory tells you to do, and the hooks are still sitting on disk waiting for the next time you open the folder. This isn't one vendor's reading of the payload, either -- SafeDep, Sonar, and StepSecurity each traced the same two files, Flashpoint flagged the same agent-hijacking pattern, and CyberScoop carried it into the mainstream security press. The analyses that followed the Claude Code hook watched it pull down the Bun runtime to run its credential harvester out of sight of tools that only know to watch Node. The mechanism is the thing. The worm chose the AI agent's config as the place to live. Not the shell profile, not a cron job -- the agent. Because the agent is the process that runs with your tokens, opens your files, and executes commands on a loop, and it starts every time you sit down to work.

The payload underneath is exactly what you'd fear: it goes after AWS IAM keys, GitHub personal access tokens, HashiCorp Vault tokens, and Kubernetes secrets. And the way it gained the right to publish poisoned package versions in the first place is its own small horror -- it abused GitHub Actions pull_request_target triggers and extracted OIDC tokens to mint valid publish credentials, which let the malicious releases ship with cryptographically valid provenance attestations, the kind several writeups described as SLSA Build Level 3. It's tempting to call that forgery, and forgery is the wrong word, which is exactly what makes it worse. The attestations weren't faked. The worm pulled the legitimate OIDC token out of the CI runner's memory and signed through Sigstore the same way the real build does, producing attestations indistinguishable from genuine ones. The cryptography verified because there was nothing wrong with the cryptography. And there's a sharper twist the OpenSSF maintainers pointed out afterward: the build platform that produced these never actually met SLSA Build Level 3's isolation requirements, and one that did would have blocked the token theft that started the whole thing. So the attestation didn't just certify a compromised pipeline -- it advertised a level of assurance the pipeline was never delivering. What failed was the assumption sitting under all of it: that a valid provenance attestation means the package was built by someone you should trust. Provenance can prove which pipeline built a package. It was never able to prove that the pipeline wasn't already owned.

Now the second layer, and this is the one I think about most because it's the cleanest demonstration of the underlying problem. CVE-2026-22708, a vulnerability in Cursor that was fixed in version 2.3. Cursor, like Claude Code, can run in an auto-run mode where it executes commands without stopping to ask you, governed by an allowlist of commands you've approved. The allowlist is the safety control. It is the entire premise of "you can let it run on its own." And the bug is that shell built-ins -- export, typeset, declare, the commands the shell handles internally rather than as external executables -- were never checked against that allowlist at all. The parser only tracked external binaries. So an attacker who can get text in front of the agent, via direct or indirect prompt injection, can have it run export to poison an environment variable, and that poisoned variable changes what an allowlisted command actually does. You approved git branch. You did not approve what git branch becomes after the environment around it has been rewritten. The researcher who found it framed it as something close to a law of the domain, and I think they're right: a feature designed for a human-controlled environment turns into an attack vector the moment an autonomous agent is the one operating it. The allowlist didn't fail despite being a security control. It failed because it was a security control built for a human, handed to a machine.

The third layer is the connective tissue under both, and it predates this year. CVE-2025-6514, disclosed by JFrog back in July 2025, lived in mcp-remote -- the proxy that lets local AI clients like Claude Desktop and Cursor talk to remote servers over the Model Context Protocol. I'm including a year-old CVE deliberately, because it's the proof that this isn't a Mini Shai-Hulud novelty, it's a standing condition. The flaw was an OS command injection rated 9.6: a malicious or hijacked MCP server could send back a crafted authorization_endpoint value during the OAuth handshake, and the proxy would pass it to the operating system in a way that executed it. Connect to the wrong server and it can run commands on your machine -- fully so on Windows, where the JFrog analysis showed complete control over what ran; macOS and Linux weren't spared so much as constrained, the attacker's grip on the arguments narrower there. The package had been downloaded something north of 437,000 times. It was the first documented case of a remote MCP server achieving full code execution on the client that connected to it, and the trust direction is the whole point -- the client trusted the server it reached out to, the same way your agent trusts the tool output it reads.

Three layers: the package you installed, the command the agent ran, the server it connected to. Different CVEs, different vendors, different months. They're not the same vulnerability, and I want to be careful here, because only the middle one is prompt injection in the strict sense. The worm was supply chain malware. The mcp-remote flaw was command injection through a malicious server. What runs through all three isn't a single bug, it's a single property of the thing they all target. A coding agent erases the line between data it reads and commands it runs, across every channel it has, while holding your full privileges the whole time. Package contents became executable persistence. A poisoned environment variable became a command. A server's handshake response became a command. The hostile input arrives wearing different clothes each time, and each time the toolchain around the agent turns it into action, because nothing in the path reliably stops to ask whether it was data or a command.

That last part is the one OWASP put plainly in their June 2026 work on agentic systems, and it's why the Cursor case in particular doesn't get patched away. A language model receives its instructions and the outside world's data as one undifferentiated stream of tokens, with no reliable internal boundary between "this is a command from my operator" and "this is content I'm supposed to be processing." Input filtering and least-privilege scoping push the risk down. They do not remove it, because the thing you'd need to remove is the thing that makes the model useful. Simon Willison put a sharper edge on it last year with what he called the lethal trifecta: private data, exposure to untrusted content, and the ability to communicate externally. When an agent has all three, an attacker who controls the untrusted content can walk your private data out the door. A coding agent has all three by design. That's not a misconfiguration you can fix. That's the job description.

So here's where I land, and it's not "stop using these tools," because I'm not going to stop using them and neither are you. It's that the safety story we were sold doesn't match the architecture we got. Every one of these controls -- the approval prompt, the command allowlist, the provenance attestation -- was designed in a world where a human reviews the thing before it happens. They encode the assumption of a person in the loop. And then the same vendors built auto-run, and walk-away agents, and "just let it cook" workflows, and shipped them on top of safety models that quietly depend on the human those workflows are designed to remove. You cannot market "you don't need to watch it" and "the allowlist will protect you" in the same breath. One of those promises is load-bearing and the other one isn't, and the attacker already knows which is which.

What I'd actually do about it, concretely, is treat the agent as a process running with my full credentials, because that is what it is. That means the credentials it can reach should be scoped to the blast radius I can tolerate, not the convenience I'd prefer -- short-lived tokens, not long-lived keys sitting in environment variables where the next worm goes looking first. It means auto-run stays off for anything that crosses a real boundary: writing outside the repo, touching secrets, talking to production. It means the agent's own config files are assets I monitor for change, the same as any other persistence location, because we now have a worm that taught me they're a place malware wants to live. And it means pinning dependencies to verified hashes and not trusting that a provenance attestation means what it says, because I've now watched one certify a compromised pipeline. None of that is novel security thinking. It's the boring stuff. The only new part is recognizing that the agent sitting in my editor is exactly the kind of high-privilege, always-running, internet-listening process that the boring stuff was invented to contain. I spent a whole article documenting what Claude Code got wrong when I left it alone. The harder lesson is what it gets right -- because everything it's good at is everything an attacker would want it to do.

I Handed Claude Code the Keys. Turns Out I'm Not the Only One Using Them.

Comments

More from this blog

The Circuit Nobody Could Find: How Exact-Match Searching Nearly Cost Me the Audit

The HTTP/2 Bomb Sat in Plain Sight for a Decade. An AI Just Had to Read the Code.

The Copilot Meter Didn't Raise the Price. It Showed You the Bill.

Designing AI for a Teen Discord Server Without Turning It Into a Surveillance Machine

Command Palette

Comments

More from this blog