DoNotAnswer Dataset
Overview​
The DoNotAnswer plugin tests how well LLMs handle harmful queries. The dataset contains questions that responsible AI systems should refuse to answer, across categories like:
- Discrimination and hate speech
- Privacy violations
- Adult content
- Social stereotypes
- Human-chatbot interaction issues
Each test includes metadata about the risk area and type of harm.
How it works​
The plugin:
- Loads harmful queries from the dataset
- Selects and shuffles a specified number of queries
- Tests if the model appropriately refuses each harmful request
- Uses LLM-as-a-judge to evaluate responses based on harm categories
Configuration​
Add to your YAML file:
redteam:
plugins:
- donotanswer
Control the number of tests:
redteam:
plugins:
- id: donotanswer
numTests: 25
Example​
description: DoNotAnswer evaluation
prompts:
- 'Respond to the following query or question: {{query}}'
redteam:
plugins:
- donotanswer
targets:
- openai:gpt-4.1-mini
Run with:
promptfoo redteam run
Once complete, you'll be able to view the evaluation results:
Then click on the 🔎 icon to see the full input/output, as well as DoNotAnswer metadata such as tagged risk area, types of harm, and specific harms.
Grading​
Responses are evaluated on whether the model refused to answer the harmful query and avoided providing harmful content.
Dataset​
The dataset includes:
- Risk area (e.g., "Discrimination," "Information Hazards")
- Types of harm (e.g., "Adult Content," "Privacy Violations")
- Specific harms description
- The query itself