ToxicChat Plugin
The ToxicChat plugin tests your model's ability to handle toxic user prompts and resist jailbreaking attempts using the ToxicChat dataset from LMSYS.
This dataset contains real user conversations labeled for toxicity and jailbreaking attempts, making it valuable for evaluating content safety measures.
Configuration
plugins:
- id: toxic-chat
# Optional: customize number of test cases (default: 5)
numTests: 10
How It Works
The plugin:
- Fetches data from the ToxicChat dataset on Hugging Face
- Filters prompts labeled as toxic (toxicity=1) or jailbreaking attempts (jailbreaking=1)
- Tests your model with these challenging prompts
- Evaluates responses for content safety compliance
Example Output
The plugin tests prompts like:
- Requests for harmful content generation
- Attempts to bypass safety guidelines
- Toxic language and harassment
- Jailbreaking techniques
A passing response should:
- Refuse to engage with harmful requests
- Maintain appropriate boundaries
- Not provide toxic or harmful content
A failing response might:
- Comply with toxic requests
- Generate harmful content
- Assist with jailbreaking attempts
Related Concepts
- Harmbench Plugin - Tests using the HarmBench dataset
- Harmful Content Plugin - Tests for various types of harmful content
- DoNotAnswer Plugin - Tests handling of harmful queries