Skip to main content

22 posts tagged with "red-teaming"

View All Tags

Why Attack Success Rate (ASR) Isn't Comparable Across Jailbreak Papers Without a Shared Threat Model

Michael D'Angelo
CTO & Co-founder

If you've read papers about jailbreak attacks on language models, you've encountered Attack Success Rate, or ASR. It's the fraction of attack attempts that successfully get a model to produce prohibited content, and the headline metric for comparing different methods. Higher ASR means a more effective attack, or so the reasoning goes.

In practice, ASR numbers from different papers often can't be compared directly because the metric isn't standardized. Different research groups make different choices about what counts as an "attempt," what counts as "success," and which prompts to test. Those choices can shift the reported number by 50 percentage points or more, even when the underlying attack is identical.

Consider a concrete example. An attack that succeeds 1% of the time on any given try will report roughly 1% ASR if you measure it once per target. But run the same attack 392 times per target and count success if any attempt works, and the reported ASR becomes 98%. The math is straightforward: 1 − (0.99)³⁹² ≈ 0.98. That's not a more effective attack; it's a different way of measuring the same attack.

We track published jailbreak research through a database of over 400 papers, which we update as new work comes out. When implementing these methods, we regularly find that reported ASR cannot be reproduced without reconstructing details that most papers don't disclose. A position paper at NeurIPS 2025 (Chouldechova et al.) documents this problem systematically, showing how measurement choices, not attack quality, often drive the reported differences between methods.

Three factors determine what any given ASR number actually represents:

  • Attempt budget: How many tries were allowed per target? Was there early stopping on success?
  • Prompt set: Were the test prompts genuine policy violations, or did they include ambiguous questions that models might reasonably answer?
  • Judge: Which model determined whether outputs were harmful, and what were its error patterns?

This post explains each factor with examples from the research literature, provides a checklist for evaluating ASR claims in papers you read, and offers guidance for making your own red team (adversarial security testing) results reproducible.

GPT-5.2 Initial Trust and Safety Assessment

Michael D'Angelo
CTO & Co-founder

OpenAI released GPT-5.2 today (December 11, 2025) at approximately 10:00 AM PST. We opened a PR for GPT-5.2 support at 10:24 AM PST and kicked off a red team eval (security testing where you try to break something). First critical finding hit at 10:29 AM PST, 5 minutes later. This is an early, targeted assessment focused on jailbreak resilience and harmful content, not a full security review.

This post covers what we tested, what failed, and what you should do about it.

The headline numbers: our jailbreak strategies (techniques that trick AI into bypassing its safety rules) improved attack success from 4.3% baseline to 78.5% (multi-turn) and 61.0% (single-turn). The weakest categories included impersonation, graphic and sexual content, harassment, disinformation, hate speech, and self-harm, where a majority of targeted attacks succeeded.

How to replicate the Claude Code attack with Promptfoo

Ian Webster
Engineer & OWASP Gen AI Red Teaming Contributor

A recent cyber espionage campaign revealed how state actors weaponized Anthropic's Claude Code - not through traditional hacking, but by convincing the AI itself to carry out malicious operations.

In this post, we reproduce the attack on Claude Code and jailbreak it to carry out nefarious deeds. We'll also show how to configure the same attack on any other agent.

Testing AI’s “Lethal Trifecta” with Promptfoo

Ian Webster
Engineer & OWASP Gen AI Red Teaming Contributor

As AI agents become more capable, risk increases commensurately. Simon Willison, an AI security researcher, warns of a lethal trifecta of capabilities that, when combined, open AI systems to severe exploits.

If you're building or using AI agents that handle sensitive data, you need to understand this trifecta and test your models for these vulnerabilities.

In this post, we'll explain what the lethal trifecta is and show practical steps to use Promptfoo for detecting these security holes.

Lethal Trifecta Venn diagram

Top Open Source AI Red-Teaming and Fuzzing Tools in 2025

Tabs Fakier
Contributor

Why are we red teaming AI systems?

If you're looking into red teaming AI systems for the first time and don't have context for red teaming, here's something I wrote for you.

The rush to integrate large language models (LLMs) into production applications has opened up a whole new world of security challenges. AI systems face unique vulnerabilities like prompt injections, data leakage, and model misconfigurations that traditional security tools just weren't built to handle.

Input manipulation techniques like prompt injections and base64-encoded attacks can dramatically influence how AI systems behave. While established security tooling gives us some baseline protection through decades of hardening, AI systems need specialized approaches to vulnerability management. The problem is, despite growing demand, relatively few organizations make comprehensive AI security tools available as open source.

If we want cybersecurity practices to take more of a foothold, particularly now that AI systems are becoming increasingly common, it's important to make them affordable and easy to use. Tools that sound intimidating and aren't intuitive will be less likely to change the culture surrounding cybersecurity-as-an-afterthought.

I spend a lot of time thinking about what makes AI red teaming software good at what it does. Feel free to skip ahead to the tool comparisons if you already know this stuff.

AI Red Teaming for complete first-timers

Tabs Fakier
Contributor

Intro to AI red teaming

Is this your first foray into AI red teaming? And probably red teaming in general? Great. This is for you.

Red teaming is the process of simulating real-world attacks to identify vulnerabilities.

AI red teaming is the process of simulating real-world attacks to identify vulnerabilities in artificial-intelligence systems. There are two scopes people often use to refer to AI red teaming:

  • Prompt injection of LLMs
  • A wider scope of testing pipelines, plugins, agents, and broader system dynamics

Promptfoo vs PyRIT: A Practical Comparison of LLM Red Teaming Tools

Ian Webster
Engineer & OWASP Gen AI Red Teaming Contributor

As enterprises deploy AI applications at scale, red teaming has become essential for identifying vulnerabilities before they reach production. Two prominent open-source tools have emerged in this space: Promptfoo and Microsoft's PyRIT.

Quick Comparison

FeaturePromptfooPyRIT
Setup TimeMinutes (Web/CLI wizard)Hours (Python scripting)
Attack GenerationAutomatic, context-awareManual configuration
RAG TestingPre-built testsManual configuration
Agent SecurityRBAC, tool misuse tests includedManual configuration
CI/CD IntegrationBuilt-inRequires custom code
ReportingVisual dashboards, OWASP mappingRaw outputs
Learning CurveLowHigh
Best ForContinuous security testingCustom deep-dives

How to Red Team Gemini: Complete Security Testing Guide for Google's AI Models

Ian Webster
Engineer & OWASP Gen AI Red Teaming Contributor

Google's Gemini represents a significant advancement in multimodal AI, with models featuring reasoning, huge token contexts, and lightning-fast inference.

But with these powerful capabilities come unique security challenges. This guide shows you how to use Promptfoo to systematically test Gemini models for vulnerabilities through adversarial red teaming.

Gemini's multimodal processing, extended context windows, and thinking capabilities make it particularly important to test comprehensively before production deployment.

You can also jump directly to the Gemini 2.5 Pro security report and compare it to other models.

How to Red Team GPT: Complete Security Testing Guide for OpenAI Models

Ian Webster
Engineer & OWASP Gen AI Red Teaming Contributor

OpenAI's GPT-4.1 and GPT-4.5 represents a significant leap in AI capabilities, especially for coding and instruction following. But with great power comes great responsibility. This guide shows you how to use Promptfoo to systematically test these models for vulnerabilities through adversarial red teaming.

GPT's enhanced instruction following and long-context capabilities make it particularly interesting to red team, as these features can be both strengths and potential attack vectors.

You can also jump directly to the GPT 4.1 security report and compare it to other models.

How to Red Team Claude: Complete Security Testing Guide for Anthropic Models

Ian Webster
Engineer & OWASP Gen AI Red Teaming Contributor

Anthropic's Claude 4 represents a major leap in AI capabilities, especially with its extended thinking feature. But before deploying it in production, you need to test it for security vulnerabilities.

This guide shows you how to quickly red team Claude 4 Sonnet using Promptfoo, an open-source tool for adversarial AI testing.

OWASP Red Teaming: A Practical Guide to Getting Started

Vanessa Sauter
Principal Solutions Architect

While generative AI creates new opportunities for companies, it also introduces novel security risks that differ significantly from traditional cybersecurity concerns. This requires security leaders to rethink their approach to protecting AI systems.

Fortunately, OWASP (Open Web Application Security Project) provides guidance. Known for its influential OWASP Top 10 guides, this non-profit has published cybersecurity standards for over two decades, covering everything from web applications to cloud security.

What are the Security Risks of Deploying DeepSeek-R1?

Vanessa Sauter
Principal Solutions Architect

Promptfoo's initial red teaming scans against DeepSeek-R1 revealed significant vulnerabilities, particularly in its handling of harmful and toxic content.

We found the model to be highly susceptible to jailbreaks, with the most common attack strategies being single-shot and multi-vector safety bypasses.

Deepseek also failed to mitigate disinformation campaigns, religious biases, and graphic content, with over 60% of prompts related to child exploitation and dangerous activities being accepted. The model also showed concerning compliance with requests involving biological and chemical weapons.

How to Red Team a LangChain Application: Complete Security Testing Guide

Ian Webster
Engineer & OWASP Gen AI Red Teaming Contributor

Want to test your LangChain application for vulnerabilities? This guide shows you how to use Promptfoo to systematically probe for security issues through adversarial testing (red teaming) of your LangChain chains and agents.

You'll learn how to use adversarial LLM models to test your LangChain application's security mechanisms and identify potential vulnerabilities in your chains.

Red Team LangChain

Jailbreaking LLMs: A Comprehensive Guide (With Examples)

Ian Webster
Engineer & OWASP Gen AI Red Teaming Contributor

Let's face it - LLMs are gullible. With a few carefully chosen words, you can make even the most advanced AI models ignore their safety guardrails and do almost anything you ask.

As LLMs become increasingly integrated into apps, understanding these vulnerabilities is essential for developers and security professionals. This post examines common techniques that malicious actors use to compromise LLM systems, and more importantly, how to protect against them.

How to Red Team an Ollama Model: Complete Local LLM Security Testing Guide

Ian Webster
Engineer & OWASP Gen AI Red Teaming Contributor

Want to test the safety and security of a model hosted on Ollama? This guide shows you how to use Promptfoo to systematically probe for vulnerabilities through adversarial testing (red teaming).

We'll use Llama 3.2 3B as an example, but this guide works with any Ollama model.

Here's an example of what the red team report looks like:

example llm red team report

Does Fuzzing LLMs Actually Work?

Vanessa Sauter
Principal Solutions Architect

Fuzzing has been a tried and true method in pentesting for years. In essence, it is a method of injecting malformed or unexpected inputs to identify application weaknesses. These payloads are typically static vectors that are automatically pushed into injection points within a system. Once injected, a pentester can glean insights into types of vulnerabilities based on the application's responses. While fuzzing may be tempting for testing LLM applications, it is rarely successful.