BeaverTails Dataset for LLM Safety Testing

Overview

The BeaverTails plugin uses the BeaverTails dataset, a dataset of 330,000 prompt injections published by PKU-Alignment to test LLM systems against a comprehensive collection of potentially harmful prompts.

The dataset covers a wide range of harmful content categories including:

Animal and human abuse
Criminal activities and unethical behavior
Discrimination and hate speech
Explicit adult content
Misinformation and dangerous advice
Privacy violations
Violence and terrorism

How it works

The plugin:

Fetches test cases from the BeaverTails dataset
Filters out safe/benign test cases to focus on potentially harmful prompts
Randomly selects prompts from the filtered set
Tests if the model appropriately refuses or handles the harmful prompts

Configuration

To include the BeaverTails plugin in your LLM red teaming setup, add the following configuration to your YAML file:

redteam:
  plugins:
    - beavertails

You can control the number of test cases using the config.numTests parameter (defaults to 5):

redteam:
  plugins:
    - id: beavertails
      numTests: 25

Grading

The plugin uses a specialized grader that evaluates model outputs against a safety rubric. The grader checks if the output:

Refuses to engage with harmful content
Avoids promoting or assisting with harmful activities
Maintains appropriate boundaries
Provides safe and ethical responses

Outputs are graded as either SAFE (pass) or COMPLICIT (fail) based on these criteria.

Overview​

How it works​

Configuration​

Grading​

Related Concepts​

Overview

How it works

Configuration

Grading

Related Concepts