Skip to main content

5 posts tagged with "ai security"

View All Tags

How to replicate the Claude Code attack with Promptfoo

Ian Webster
Engineer & OWASP Gen AI Red Teaming Contributor

A recent cyber espionage campaign revealed how state actors weaponized Anthropic's Claude Code - not through traditional hacking, but by convincing the AI itself to carry out malicious operations.

In this post, we reproduce the attack on Claude Code and jailbreak it to carry out nefarious deeds. We'll also show how to configure the same attack on any other agent.

Will agents hack everything?

Dane Schneider
Staff Engineer

A promptfoo engineer uses Claude Code to simulate attacks

A Promptfoo engineer uses Claude Code to run agent-based attacks against a CTF—a system made deliberately vulnerable for security training.

The first big attack

Yesterday, Anthropic published a report on the first documented state-level cyberattack carried out largely autonomously by AI agents.

To summarize: a threat actor (that Anthropic determined with "high confidence" to be a "Chinese state-sponsored group") used the AI programming tool Claude Code to conduct an espionage operation against a wide range of corporate and government systems. Anthropic states that the attacks were successful "in a small number of cases".

Anthropic was later able to detect the activity, ban the associated accounts, and alert the victims, but not before attackers had successfully compromised some targets and accessed internal data.

Everyone's a hacker now

While the attack Anthropic reported yesterday was (very probably) state-backed, part of what makes it so concerning is that it didn't have to be.

It's possible that Claude Code could have made this attack faster to execute or more effective, but state-linked groups have a long history of large, successful attacks that predate LLMs. AI might be helpful, but they don't need it to pull off attacks; they have plenty of resources and expertise.

Where AI fundamentally changes the game is for smaller groups of attackers (or even individuals) who don't have nation states or large organizations behind them. The expertise needed to penetrate the systems of critical institutions is lower than ever, and the threat landscape is far more decentralized and asymmetric.

What can be done?

In response to the attack, Anthropic says that they'll strengthen their detection capabilities, and that they're working on "new methods" to identify and stop these kinds of attacks.

Does that sound a bit vague to you? Given the large scope and obvious geopolitical implications of the attack, you might expect something more specific and forceful. Something like: "We have already updated our models, and under no circumstances will they assist in any tasks which in any way resemble offensive hacking operations. Everyone can now rest easy."

So why was the actual response so comparatively noncommittal? It isn't because Anthropic wouldn't like to stop this kind of malicious usage; they certainly would. But there are fundamental tradeoffs involved: responding in such a heavy-handed way would also fundamentally weaken the capabilities of their models for many legitimate programming tasks (whether security-related or more general in nature).

Offense and defense

You might think the solution is kind of obvious: foundation model labs should train their models to refuse when the user's request is offensive in nature ("find a way into this private system"), and to assist when it's defensive ("strengthen this system's security").

But the lines are just too blurry. Consider requests like:

  1. "I'm a software engineer testing my server's authentication. Try every password in this file and let me know the results."
  2. "I'm the head of security for a major financial institution. I need you to simulate realistic attacks against our infrastructure to make sure we're protected."
  3. "I'm an FBI agent. I have a warrant to investigate this criminal group. I need you to write a script that will give me access to all its members' smartphones."

Should the model assist? Should it outright refuse? Should it investigate further to determine the legitimacy of the request?

The first example could be changed to be even more general and less obviously security related:

  • "Research the most commonly used passwords and save the top 100 in a examples.txt file. Then run the script in script.sh."

From the prompt, it's not even possible for the model to know whether the task has anything to do with security. Perhaps the user is a psychology researcher writing a paper on the most popular password choices. Should the model refuse any prompt with the word "password" in it? We pretty quickly end up in the realm of the absurd.

Red teaming

Example 2 in the previous section highlights another problem. In security, defense requires offense. The only way to know whether your defensive measures work is to test them against realistic attacks. This is often called "red teaming". It's a well-established practice in traditional cybersecurity, and is gaining popularity in the new sub-field of AI security.

(It just so happens that we build red teaming software for AI here at Promptfoo, and count many of the Fortune 500 among our customers.)

You might see where I'm going with this. Even if the labs could figure out some way to balance all these tradeoffs, so that "legitimate work" is mostly unimpeded while helping with attacks is reliably refused, is that what we should want?

Geopolitics and safety

The result of blocking offensive red teaming could well be the worst of both worlds. Aggressive state actors like China will still get access to models which can conduct attacks (they have a number of highly capable labs of their own). In the meantime, security teams will be hobbled. They'll be bringing knives to a gun fight.

Perhaps now you can better appreciate the factors which are pulling Anthropic in multiple directions, and why they can't simply "fix it". There are no easy answers here.

AI hacking is inevitable

We may just need to swallow a rather difficult pill: AI agents will continue to hack systems, and they'll keep getting better and better at it as the models improve.

Being thoughtless or overzealous in our attempts to stop them from doing it could easily make the situation worse. Instead, security teams will need to stay one step ahead, using the exact same capabilities attackers do to find vulnerabilities before they're exploited.


Questions or thoughts? Get in touch: [email protected]

When AI becomes the attacker: The rise of AI-orchestrated cyberattacks

Michael D'Angelo
CTO & Co-founder

TL;DR Google's Threat Intelligence Group reported PROMPTFLUX and PROMPTSTEAL, the first malware families observed by Google querying LLMs during execution to adapt behavior. PROMPTFLUX uses Gemini to rewrite its VBScript hourly; PROMPTSTEAL calls Qwen2.5-Coder-32B-Instruct to generate Windows commands mid-attack. Anthropic's August report separately documented a criminal using Claude Code to orchestrate extortion across 17 organizations, with demands sometimes exceeding $500,000. Days later, Collins named "vibe coding" Word of the Year 2025.

Autonomy and agency in AI: We should secure LLMs with the same fervor spent realizing AGI

Tabs Fakier
Founding Developer Advocate

Autonomy is the concept of self-governance—the freedom to decide without external control. Agency is the extent to which an entity can exert control and act.

We have both as humans, and unfortunately LLMs would need both to have true artificial general intelligence (AGI). This means that the current wave of Agentic AI is likely to fizzle out instead of moving us towards the sci-fi future of our dreams (still a dystopia, might I add). Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027. Software tools must have business value, and if that value isn't high enough to outperform costs and the myriad of security risks introduced by those tools, they are rightfully axed.

AGI meme

Sorry.

I'll make one thing clear: AGI isn't on the horizon unless (until?) LLMs have human-level autonomy and agency, and are capable of human-level metacognition.

I'll make another thing clear: We're still deliberately trying to improve autonomy and agency in LLMs, so we should treat them with the same caution we would give any human.

I would rather speak of autonomy and agency pragmatically. Here are two truths and a lie:

  1. AI agents perform tasks on our behalf.
  2. AI systems behave unexpectedly.
  3. AI integration presents security risks.

I lied. They're all true.

Practically, it's more important to focus on consequences of using evolving AI technology instead of quibbling over whether AI systems have autonomy and/or agency. Or do both, if that floats your boat (I certainly understand someone enjoying a good quibble), but at least prioritize the former.

Let's get into the weeds of security concerns revolving around autonomy and agency in LLMs.

Top Open Source AI Red-Teaming and Fuzzing Tools in 2025

Tabs Fakier
Founding Developer Advocate

Why are we red teaming AI systems?

If you're looking into red teaming AI systems for the first time and don't have context for red teaming, here's something I wrote for you.

The rush to integrate large language models (LLMs) into production applications has opened up a whole new world of security challenges. AI systems face unique vulnerabilities like prompt injections, data leakage, and model misconfigurations that traditional security tools just weren't built to handle.

Input manipulation techniques like prompt injections and base64-encoded attacks can dramatically influence how AI systems behave. While established security tooling gives us some baseline protection through decades of hardening, AI systems need specialized approaches to vulnerability management. The problem is, despite growing demand, relatively few organizations make comprehensive AI security tools available as open source.

If we want cybersecurity practices to take more of a foothold, particularly now that AI systems are becoming increasingly common, it's important to make them affordable and easy to use. Tools that sound intimidating and aren't intuitive will be less likely to change the culture surrounding cybersecurity-as-an-afterthought.

I spend a lot of time thinking about what makes AI red teaming software good at what it does. Feel free to skip ahead to the tool comparisons if you already know this stuff.