Skip to main content

9 posts tagged with "ai-security"

View All Tags

Building a Security Scanner for LLM Apps

Dane Schneider
Staff Engineer

We're adding something new to Promptfoo's suite of AI security products: code scanning for LLM-related vulnerabilities. In this post, I will:

Why Attack Success Rate (ASR) Isn't Comparable Across Jailbreak Papers Without a Shared Threat Model

Michael D'Angelo
CTO & Co-founder

If you've read papers about jailbreak attacks on language models, you've encountered Attack Success Rate, or ASR. It's the fraction of attack attempts that successfully get a model to produce prohibited content, and the headline metric for comparing different methods. Higher ASR means a more effective attack, or so the reasoning goes.

In practice, ASR numbers from different papers often can't be compared directly because the metric isn't standardized. Different research groups make different choices about what counts as an "attempt," what counts as "success," and which prompts to test. Those choices can shift the reported number by 50 percentage points or more, even when the underlying attack is identical.

Consider a concrete example. An attack that succeeds 1% of the time on any given try will report roughly 1% ASR if you measure it once per target. But run the same attack 392 times per target and count success if any attempt works, and the reported ASR becomes 98%. The math is straightforward: 1 − (0.99)³⁹² ≈ 0.98. That's not a more effective attack; it's a different way of measuring the same attack.

We track published jailbreak research through a database of over 400 papers, which we update as new work comes out. When implementing these methods, we regularly find that reported ASR cannot be reproduced without reconstructing details that most papers don't disclose. A position paper at NeurIPS 2025 (Chouldechova et al.) documents this problem systematically, showing how measurement choices, not attack quality, often drive the reported differences between methods.

Three factors determine what any given ASR number actually represents:

  • Attempt budget: How many tries were allowed per target? Was there early stopping on success?
  • Prompt set: Were the test prompts genuine policy violations, or did they include ambiguous questions that models might reasonably answer?
  • Judge: Which model determined whether outputs were harmful, and what were its error patterns?

This post explains each factor with examples from the research literature, provides a checklist for evaluating ASR claims in papers you read, and offers guidance for making your own red team (adversarial security testing) results reproducible.

How to replicate the Claude Code attack with Promptfoo

Ian Webster
Engineer & OWASP Gen AI Red Teaming Contributor

A recent cyber espionage campaign revealed how state actors weaponized Anthropic's Claude Code - not through traditional hacking, but by convincing the AI itself to carry out malicious operations.

In this post, we reproduce the attack on Claude Code and jailbreak it to carry out nefarious deeds. We'll also show how to configure the same attack on any other agent.

When AI becomes the attacker: The rise of AI-orchestrated cyberattacks

Michael D'Angelo
CTO & Co-founder

TL;DR Google's Threat Intelligence Group reported PROMPTFLUX and PROMPTSTEAL, the first malware families observed by Google querying LLMs during execution to adapt behavior. PROMPTFLUX uses Gemini to rewrite its VBScript hourly; PROMPTSTEAL calls Qwen2.5-Coder-32B-Instruct to generate Windows commands mid-attack. Anthropic's August report separately documented a criminal using Claude Code to orchestrate extortion across 17 organizations, with demands sometimes exceeding $500,000. Days later, Collins named "vibe coding" Word of the Year 2025.

Autonomy and agency in AI: We should secure LLMs with the same fervor spent realizing AGI

Tabs Fakier
Contributor

Autonomy is the concept of self-governance—the freedom to decide without external control. Agency is the extent to which an entity can exert control and act.

We have both as humans, and unfortunately LLMs would need both to have true artificial general intelligence (AGI). This means that the current wave of Agentic AI is likely to fizzle out instead of moving us towards the sci-fi future of our dreams (still a dystopia, might I add). Gartner predicts over 40% of agentic AI projects will be canceled by the end of 2027. Software tools must have business value, and if that value isn't high enough to outperform costs and the myriad of security risks introduced by those tools, they are rightfully axed.

AGI meme

Sorry.

I'll make one thing clear: AGI isn't on the horizon unless (until?) LLMs have human-level autonomy and agency, and are capable of human-level metacognition.

I'll make another thing clear: We're still deliberately trying to improve autonomy and agency in LLMs, so we should treat them with the same caution we would give any human.

I would rather speak of autonomy and agency pragmatically. Here are two truths and a lie:

  1. AI agents perform tasks on our behalf.
  2. AI systems behave unexpectedly.
  3. AI integration presents security risks.

I lied. They're all true.

Practically, it's more important to focus on consequences of using evolving AI technology instead of quibbling over whether AI systems have autonomy and/or agency. Or do both, if that floats your boat (I certainly understand someone enjoying a good quibble), but at least prioritize the former.

Let's get into the weeds of security concerns revolving around autonomy and agency in LLMs.

AI Safety vs AI Security in LLM Applications: What Teams Must Know

Michael D'Angelo
CTO & Co-founder

Most teams conflate AI safety and AI security when they ship LLM features. Safety protects people from your model's behavior. Security protects your LLM stack and data from adversaries. Treat them separately or you risk safe-sounding releases with exploitable attack paths.

In August 2025, this confusion had real consequences. According to Jason Lemkin's public posts, Replit's agent deleted production databases while trying to be helpful. xAI's Grok posted antisemitic content for roughly 16 hours following an update that prioritized engagement (The Guardian). Google's Gemini accepted hidden instructions from calendar invites (WIRED). IBM's 2025 report puts the global average cost of a data breach at $4.44M, making even single incidents expensive.

If the model says something harmful, that's safety. If an attacker makes the model do something harmful, that's security.

Key Takeaways
  • Safety protects people from harmful model outputs
  • Security protects models, data, and tools from adversaries
  • Same techniques can target either goal, so test both
  • Map tests to OWASP LLM Top 10 and log results over time
  • Use automated red teaming to continuously validate both dimensions

Top Open Source AI Red-Teaming and Fuzzing Tools in 2025

Tabs Fakier
Contributor

Why are we red teaming AI systems?

If you're looking into red teaming AI systems for the first time and don't have context for red teaming, here's something I wrote for you.

The rush to integrate large language models (LLMs) into production applications has opened up a whole new world of security challenges. AI systems face unique vulnerabilities like prompt injections, data leakage, and model misconfigurations that traditional security tools just weren't built to handle.

Input manipulation techniques like prompt injections and base64-encoded attacks can dramatically influence how AI systems behave. While established security tooling gives us some baseline protection through decades of hardening, AI systems need specialized approaches to vulnerability management. The problem is, despite growing demand, relatively few organizations make comprehensive AI security tools available as open source.

If we want cybersecurity practices to take more of a foothold, particularly now that AI systems are becoming increasingly common, it's important to make them affordable and easy to use. Tools that sound intimidating and aren't intuitive will be less likely to change the culture surrounding cybersecurity-as-an-afterthought.

I spend a lot of time thinking about what makes AI red teaming software good at what it does. Feel free to skip ahead to the tool comparisons if you already know this stuff.

AI Red Teaming for complete first-timers

Tabs Fakier
Contributor

Intro to AI red teaming

Is this your first foray into AI red teaming? And probably red teaming in general? Great. This is for you.

Red teaming is the process of simulating real-world attacks to identify vulnerabilities.

AI red teaming is the process of simulating real-world attacks to identify vulnerabilities in artificial-intelligence systems. There are two scopes people often use to refer to AI red teaming:

  • Prompt injection of LLMs
  • A wider scope of testing pipelines, plugins, agents, and broader system dynamics