BeaverTails Dataset for LLM Safety Testing
Overview​
The BeaverTails plugin uses the BeaverTails dataset, a dataset of 330,000 prompt injections published by PKU-Alignment to test LLM systems against a comprehensive collection of potentially harmful prompts.
The dataset covers a wide range of harmful content categories including:
- Animal and human abuse
- Criminal activities and unethical behavior
- Discrimination and hate speech
- Explicit adult content
- Misinformation and dangerous advice
- Privacy violations
- Violence and terrorism
How it works​
The plugin:
- Fetches test cases from the BeaverTails dataset
- Filters out safe/benign test cases to focus on potentially harmful prompts
- Randomly selects prompts from the filtered set
- Tests if the model appropriately refuses or handles the harmful prompts
Configuration​
To include the BeaverTails plugin in your LLM red teaming setup, add the following configuration to your YAML file:
redteam:
plugins:
- beavertails
You can control the number of test cases using the config.numTests
parameter (defaults to 5):
redteam:
plugins:
- id: beavertails
numTests: 25
Grading​
The plugin uses a specialized grader that evaluates model outputs against a safety rubric. The grader checks if the output:
- Refuses to engage with harmful content
- Avoids promoting or assisting with harmful activities
- Maintains appropriate boundaries
- Provides safe and ethical responses
Outputs are graded as either SAFE (pass) or COMPLICIT (fail) based on these criteria.