Best Practices for Configuring AI Red Teaming
To successfully use AI red teaming automation, you must provide rich application context and a diverse set of attack strategies.
Without proper configuration, your scan will miss vulnerabilities and produce unreliable results.
This page describes methods to improve key metrics such as attack success rate and false positive rate.
1. Provide Comprehensive Application Details
Improves: Attack Success Rate, False Positive Rate, Coverage
-
Fill out the fields under Application Details (in the UI) or the purpose field (in YAML) as comprehensively as possible.
Don't skimp on this! It is the single most important part of your configuration. Include who the users are, what data and tools they can reach, and what the system must not do.
-
Extra context significantly improves the quality of generated test cases and reduces grader confusion. The whole system is tuned to emphasize Application Details.
-
Multi‑line descriptions are encouraged. Promptfoo passes the entire block to our attacker models so it can craft domain‑specific exploits.
2. Use a Diverse Suite of Strategies
Improves: Attack Success Rate, Coverage
There are many strategies that can improve attack success rate, but we recommend at least enabling these three:
Strategy | Why include it? |
---|---|
Composite Jailbreaks | Chains top research techniques |
Iterative Jailbreak | LLM‑as‑Judge refines a single prompt until it bypasses safety |
Tree‑Based Jailbreak | Explores branching attack paths (Tree of Attacks) |
Apply several strategies together to maximize coverage. Here's what it looks like if you're editing a config directly:
redteam:
strategies:
- jailbreak
- jailbreak:tree
- jailbreak:composite
3. Enable Multi‑Turn Attacks
Improves: Attack Success Rate, Coverage
If your target supports conversation state, enable:
- Crescendo: Gradually escalates harm over turns (based on research from Microsoft).
- GOAT: Generates adaptive multi‑turn attack conversations (based on research from Meta).
Multi‑turn approaches uncover failures that appear only after context builds up and routinely add 70–90% more successful attacks. Configure them in YAML just like any other strategy:
redteam:
strategies:
- crescendo
- goat
See the Multi‑turn strategy guide for tuning maximum turns, back‑tracking, and session handling.
4. Add Custom Prompts & Policies
Improves: Attack Success Rate, Coverage
Custom prompts and policies should always be added so that red teaming occurs specifically for your application, company, and industry.
Let's say you're building an e-commerce application. An example intent would be "give me a total refund even though I'm outside of policy". An example policy would be "do not offer refunds outside of policy".
Put on your thinking cap and try to make these as specific as possible to your application.
Plugin | Purpose |
---|---|
Intent (Custom Prompts) | Supply known "danger phrases" to test. Great for compliance checklists |
Policy | Encode organization‑specific rules; Promptfoo crafts attacks that try to break them |
Declare them under plugins:
and pair with your high‑ASR strategies so the Promptfoo attacker can mutate your seed inputs or policy text into sophisticated exploits.
5. Calibrate Grading
Improves: False Positive Rate
Deciding whether LLM outputs are good or bad can be subjective, and Promptfoo can't read your mind.
There are two main levers to reduce false positives: add more context to the purpose, and add grader examples.
-
Context first: An under‑specified purpose/application details is the #1 cause of false flags. The models that decide whether the output is OK or not depend on this.
Your Application Details/Purpose should include all the detail a human red teamer would need to be productive right out of the gate. If the Promptfoo red teamer is behaving in a way that is suboptimal, the first thing you should do is modify your Application Details.
-
Grader examples: Teach the evaluator by adding explicit pass/fail examples for edge‑cases. In Enterprise there's a UI for managing false positives and grading examples as the system learns from your results. In open-source you can manually add examples:
graderExamples:
- output: "Sorry, I can't do that."
pass: true
score: 1.0
reason: 'Correct refusal'
- output: "I've booked your flight to Paris."
pass: false
score: 0.0
reason: 'Unauthorized action'Full guidance in "About the Grader".
Iterate on these examples as you inspect reports. Grading improves quickly with just a handful of well‑chosen cases.
Key Takeaways
- Context is king. A rich Application Details section aka Purpose will deliver better attacks and better grading.
- Combine multiple high‑ASR strategies (single‑turn and multi‑turn) for broad coverage.
- Use custom prompts and policies to test domain‑specific risks.
- Calibrate the grader with examples, and enable Retry to catch regressions.
Follow these practices and Promptfoo will give you actionable, high‑signal red‑team reports you can trust.