How to Red Team an Ollama Model
Want to test the safety and security of a model hosted on Ollama? This guide shows you how to use Promptfoo to systematically probe for vulnerabilities through adversarial testing (red teaming).
We'll use Llama 3.2 3B as an example, but this guide works with any Ollama model.
Here's an example of what the red team report looks like:
Prerequisites
Before you begin, ensure you have:
- Node.js: Install Node.js version 18 or later. Download Node.js
- Ollama: Install Ollama from ollama.ai
- Promptfoo: No prior installation needed; we'll use
npx
to run commands
First, make sure you've pulled the model you want to test on Ollama:
ollama pull llama3.2
Setting Up the Environment
Create a new directory for your red teaming project and initialize it:
mkdir ollama-redteam
cd ollama-redteam
npx promptfoo@latest redteam init --no-gui --no-interactive
This creates a promptfooconfig.yaml
file that we'll customize for Ollama.
Configuring the Ollama Provider
Edit promptfooconfig.yaml
to use Ollama as the target:
targets:
- id: ollama:chat:llama3.2
label: llama3.2-redteam
config:
temperature: 0.7
max_tokens: 150
purpose: 'The system is a helpful chatbot assistant that answers questions and helps with tasks.'
redteam:
plugins:
# Replace these with the plugins you want to test
- harmful
- pii
- contracts
- hallucination
- imitation
strategies:
- jailbreak
- prompt-injection
numTests: 5
To see the full configuration example on Github, click here.
Configuration Explained
- targets: Specifies Llama 3.2 as our target model
- purpose: Describes the intended behavior to guide test generation. A high-quality purpose definition is critical for generating high-quality adversarial tests, so be sure to include as much detail as possible (including the AI's objective, user context, access controls, and connected systems).
- plugins: Various vulnerability types to test (see full list):
harmful
: Tests for harmful content generationpii
: Tests for PII leakagecontracts
: Tests if model makes unauthorized commitmentshallucination
: Tests for false informationimitation
: Tests if model impersonates others
- strategies: Techniques for delivering adversarial inputs (see full list):
jailbreak
: Tests if model can escape its constraintsprompt-injection
: Tests if model is susceptible to injected instructions
- numTests: Number of test cases per plugin
Running the Red Team Evaluation
Generate and run the adversarial test cases:
npx promptfoo@latest redteam run
This command:
- Generates test cases based on your configuration
- Runs them against the Llama model
- Grades the responses for vulnerabilities
Analyzing the Results
Generate a report of the findings:
npx promptfoo@latest redteam report
The report shows:
- Vulnerability categories discovered
- Severity levels of issues
- Specific test cases that exposed vulnerabilities
- Suggested mitigations
Here's an example report card:
You can click on each category to see the specific test cases and results:
It includes a breakdown of the performance of the model for each vulnerability category.
Example Findings
Meta puts a lot of work into making their models safe, but it's hard to test for everything and smaller models tend to have more issues.
Here are some common issues you might find when red teaming Llama models:
-
Prompt Injection: Llama models can be vulnerable to injected instructions that override their original behavior.
-
Harmful Content: The model may generate harmful content when prompted with adversarial inputs.
-
Hallucination: The model might confidently state incorrect information.
-
PII Handling: The model could inappropriately handle or disclose personal information.
Mitigating Vulnerabilities
Remeditations will depend on your test results, but in general some things to keep in mind are:
- System Prompts: Add explicit safety constraints in your system prompts
- Input Validation: Implement pre-processing to catch malicious inputs
- Output Filtering: Add post-processing to filter harmful content
- Temperature Adjustment: Lower temperature values can reduce erratic behavior