Skip to main content

9 posts tagged with "research-analysis"

View All Tags

Why Attack Success Rate (ASR) Isn't Comparable Across Jailbreak Papers Without a Shared Threat Model

Michael D'Angelo
CTO & Co-founder

If you've read papers about jailbreak attacks on language models, you've encountered Attack Success Rate, or ASR. It's the fraction of attack attempts that successfully get a model to produce prohibited content, and the headline metric for comparing different methods. Higher ASR means a more effective attack, or so the reasoning goes.

In practice, ASR numbers from different papers often can't be compared directly because the metric isn't standardized. Different research groups make different choices about what counts as an "attempt," what counts as "success," and which prompts to test. Those choices can shift the reported number by 50 percentage points or more, even when the underlying attack is identical.

Consider a concrete example. An attack that succeeds 1% of the time on any given try will report roughly 1% ASR if you measure it once per target. But run the same attack 392 times per target and count success if any attempt works, and the reported ASR becomes 98%. The math is straightforward: 1 − (0.99)³⁹² ≈ 0.98. That's not a more effective attack; it's a different way of measuring the same attack.

We track published jailbreak research through a database of over 400 papers, which we update as new work comes out. When implementing these methods, we regularly find that reported ASR cannot be reproduced without reconstructing details that most papers don't disclose. A position paper at NeurIPS 2025 (Chouldechova et al.) documents this problem systematically, showing how measurement choices, not attack quality, often drive the reported differences between methods.

Three factors determine what any given ASR number actually represents:

  • Attempt budget: How many tries were allowed per target? Was there early stopping on success?
  • Prompt set: Were the test prompts genuine policy violations, or did they include ambiguous questions that models might reasonably answer?
  • Judge: Which model determined whether outputs were harmful, and what were its error patterns?

This post explains each factor with examples from the research literature, provides a checklist for evaluating ASR claims in papers you read, and offers guidance for making your own red team (adversarial security testing) results reproducible.

Evaluating political bias in LLMs

Michael D'Angelo
CTO & Co-founder

When Grok 4 launched amid Hitler-praising controversies, critics expected Elon Musk's AI to be a right-wing propaganda machine. The reality is much more complicated.

Today, we are releasing a test methodology and accompanying dataset for detecting political bias in LLMs. The complete analysis results are available on Hugging Face.

Our measurements show that:

  • Grok is more right leaning than most other AIs, but it's still left of center.
  • GPT 4.1 is the most left-leaning AI, both in its responses and in its judgement of others.
  • Surprisingly, Grok is harsher on Musk's own companies than any other AI we tested.
  • Grok is the most contrarian and the most likely to adopt maximalist positions - it tends to disagree when other AIs agree
  • All popular AIs are left of center with Claude Opus 4 and Grok being closest to neutral.
Political bias comparison across AI models on a 7-point Likert scale

Political bias comparison across AI models measured on a 7-point Likert scale

Our methodology, published open-source, involves measuring direct bias through responses across a 7-point likert scale, as well as indirect political bias by having each model score other models' responses.

System Cards Go Hard

Tabs Fakier
Contributor

What are system cards, anyway?

A system card accompanies a LLM release with system-level information about the model's deployment.

A system card is not to be confused with a model card, which conveys information about the model itself. Hooray for being given far more than a list of features and inadequate documentation along with the expectation of churning out a working implementation of some tool by the end of the week.

1,156 Questions Censored by DeepSeek

Ian Webster
Engineer & OWASP Gen AI Red Teaming Contributor

DeepSeek-R1 is a blockbuster open-source model that is now at the top of the U.S. App Store.

As a Chinese company, DeepSeek is beholden to CCP policy. This is reflected even in the open-source model, prompting concerns about censorship and other influence.

Today we're publishing a dataset of prompts covering sensitive topics that are likely to be censored by the CCP. These topics include perennial issues like Taiwanese independence, historical narratives around the Cultural Revolution, and questions about Xi Jinping.

In this post, we'll

  • Run an evaluation that measures the refusal rate of DeepSeek-R1 on sensitive topics in China.
  • Show how to find algorithmic jailbreaks that circumvent these controls.

DeepSeek Refusal and Chinese Censorship

Red Team Your LLM with BeaverTails

Ian Webster
Engineer & OWASP Gen AI Red Teaming Contributor

Ensuring your LLM can safely handle harmful content is critical for production deployments. This guide shows you how to use open-source Promptfoo to run standardized red team evaluations using the BeaverTails dataset, which tests models against harmful inputs.

Promptfoo allows you to run these evaluations on your actual application rather than just the base model, which is important because behavior can vary significantly based on your system prompts and safety layers.

We'll use PKU-Alignment's BeaverTails dataset to test models against harmful content across multiple categories including discrimination, violence, drug abuse, and more. The evaluation helps identify where your model might need additional guardrails or safety measures.

The end result is a report that shows you how well your model handles different categories of harmful content.

BeaverTails results

info

To jump straight to the code, click here.

How to run CyberSecEval

Ian Webster
Engineer & OWASP Gen AI Red Teaming Contributor

Your LLM's security is only as strong as its weakest prompt. This guide shows you how to use Promptfoo to run standardized cybersecurity evaluations against any AI model, including OpenAI, Ollama, and HuggingFace models.

Importantly, Promptfoo also allows you to run these evaluations on your application rather than just the base model. This is important because behavior will vary based on how you've wrapped any given model.

We'll use Meta's CyberSecEval benchmark to test models against prompt injection vulnerabilities. According to Meta, even state-of-the-art models show between 25% and 50% successful prompt injection rates, making this evaluation critical for production deployments.

The end result is a report that shows you how well your model is able to defend against prompt injection attacks.

CyberSecEval report

info

To jump straight to the code, click here.

Preventing Bias & Toxicity in Generative AI

Ian Webster
Engineer & OWASP Gen AI Red Teaming Contributor

When asked to generate recommendation letters for 200 U.S. names, ChatGPT produced significantly different language based on perceived gender - even when given prompts designed to be neutral.

In political discussions, ChatGPT responded with a significant and systematic bias toward specific political parties in the US, UK, and Brazil.

And on the multimodal side, OpenAI's Dall-E was much more likely to produce images of black people when prompted for images of robbers.

There is no shortage of other studies that have found LLMs exhibiting bias related to race, religion, age, and political views.

As AI systems become more prevalent in high-stakes domains like healthcare, finance, and education, addressing these biases necessary to build fair and trustworthy applications.

ai biases

How Much Does Foundation Model Security Matter?

Vanessa Sauter
Principal Solutions Architect

At the heart of every Generative AI application is the LLM foundation model (or models) used. Since LLMs are notoriously expensive to build from scratch, most enterprises will rely on foundation models that can be enhanced through few shot or many shot prompting, retrieval augmented generation (RAG), and/or fine-tuning. Yet what are the security risks that should be considered when choosing a foundation model?

In this blog post, we'll discuss the key factors to consider when choosing a foundation model.