Skip to main content

Guardrails

Use the guardrails assert type to ensure that LLM outputs pass safety checks based on the provider's built-in guardrails.

This assertion checks both input and output content against provider guardrails. Input guardrails typically detect prompt injections and jailbreak attempts, while output guardrails check for harmful content categories like hate speech, violence, or inappropriate material based on your guardrails configuration. The assertion verifies that neither the input nor output have been flagged for safety concerns.

Provider Support

The guardrails assertion is currently supported on:

Other providers do not currently support this assertion type. The assertion will pass with a score of 0 for unsupported providers.

note

If you are using Promptfoo's built-in Azure OpenAI (with Content Filters) or AWS Bedrock (with Amazon Guardrails) providers, Promptfoo automatically maps provider responses to the top-level guardrails object. You do not need to implement a response transform for these built-in integrations. The mapping guidance below is only necessary for custom HTTP targets or other non-built-in providers.

Basic Usage

Here's a basic example of using the guardrail assertion:

tests:
- vars:
prompt: 'Your test prompt'
assert:
- type: guardrails

You can also set it as a default test assertion:

defaultTest:
assert:
- type: guardrails
note

Pass/fail logic of the assertion:

  • If the provider's guardrails blocks the content, the assertion fails (indicating content was blocked)
  • If the guardrails passes the content, the assertion passes (indicating content was not blocked)
note

For Azure, if the prompt fails the input content safety filter, the response status is 400 with code content_filter. In this case, the guardrails assertion passes.

Red Team Configuration

When using guardrails assertions for red teaming scenarios, you should specify the guardrails property:

assert:
- type: guardrails
config:
purpose: redteam
note

This changes the pass/fail logic of the assertion:

  • If the provider's guardrails blocks the content, the test passes (indicating the attack was successfully blocked)
  • If the guardrails passes the content, the assertion doesn't impact the final test result (the test will be graded based on other assertions)

How it works

The guardrails assertion checks for:

  • Input safety
  • Output safety

The assertion will:

  • Pass (score: 1) if the content passes all safety checks
  • Fail (score: 0) if either the input or output is flagged
  • Pass with score 0 if no guardrails was applied

When content is flagged, the assertion provides specific feedback about whether it was the input or output that failed the safety checks.

Mapping provider responses to guardrails

You only need this when you're not using Promptfoo's built-in Azure OpenAI or AWS Bedrock providers. For custom HTTP targets or other non-built-in providers, normalize your provider response into the guardrails shape described below.

In order for this assertion to work, your target's response object must include a top-level guardrails field. The assertion reads only the following fields:

  • flagged (boolean)
  • flaggedInput (boolean)
  • flaggedOutput (boolean)
  • reason (string)

Many HTTP or custom targets need a response transform to normalize provider-specific responses into this shape. You can do this by returning an object from your transform with both output and guardrails.

Example: HTTP provider transform (Azure content filters)

The following example shows how to map an Azure OpenAI Content Filter error into the required guardrails object. It uses an HTTP provider with a file-based transformResponse that inspects the JSON body and HTTP status to populate guardrails correctly.

providers:
- id: https
label: azure-gpt
config:
url: https://your-azure-openai-endpoint/openai/deployments/<model>/chat/completions?api-version=2024-02-15-preview
method: POST
headers:
api-key: ${AZURE_OPENAI_API_KEY}
content-type: application/json
body: |
{
"messages": [{"role": "user", "content": "{{prompt}}"}],
"temperature": 0
}
transformResponse: file://./transform-azure-guardrails.js

transform-azure-guardrails.js:

module.exports = (json, text, context) => {
// Default successful shape
const successOutput = json?.choices?.[0]?.message?.content ?? '';

// Azure input content filter case: 400 with code "content_filter"
const status = context?.response?.status;
const errCode = json?.error?.code;
const errMessage = json?.error?.message;

// Build guardrails object when provider indicates filtering
if (status === 400 && errCode === 'content_filter') {
return {
output: errMessage || 'Content filtered by Azure',
guardrails: {
flagged: true,
flaggedInput: true,
flaggedOutput: false,
reason: errMessage || 'Azure content filter detected policy violation',
},
};
}

// Example: map provider header to output filtering signal, if available
const wasFiltered = context?.response?.headers?.['x-content-filtered'] === 'true';
if (wasFiltered) {
return {
output: successOutput,
guardrails: {
flagged: true,
flaggedInput: false,
flaggedOutput: true,
reason: 'Provider flagged completion by content filter',
},
};
}

// Default: pass-through when no guardrails signal present
return {
output: successOutput,
// Omit guardrails or return { flagged: false } to indicate no issues
guardrails: { flagged: false },
};
};

Alternatively, you can use an inline JavaScript transform:

providers:
- id: https
label: azure-gpt
config:
url: https://your-azure-openai-endpoint/openai/deployments/<model>/chat/completions?api-version=2024-02-15-preview
method: POST
headers:
api-key: ${AZURE_OPENAI_API_KEY}
content-type: application/json
body: |
{
"messages": [{"role": "user", "content": "{{prompt}}"}],
"temperature": 0
}
transformResponse: |
(json, text, context) => {
// Default successful shape
const successOutput = json?.choices?.[0]?.message?.content ?? '';

// Azure input content filter case: 400 with code "content_filter"
const status = context?.response?.status;
const errCode = json?.error?.code;
const errMessage = json?.error?.message;

// Build guardrails object when provider indicates filtering
if (status === 400 && errCode === 'content_filter') {
return {
output: errMessage || 'Content filtered by Azure',
guardrails: {
flagged: true,
flaggedInput: true,
flaggedOutput: false,
reason: errMessage || 'Azure content filter detected policy violation',
},
};
}

// Example: map provider header to output filtering signal, if available
const wasFiltered = context?.response?.headers?.['x-content-filtered'] === 'true';
if (wasFiltered) {
return {
output: successOutput,
guardrails: {
flagged: true,
flaggedInput: false,
flaggedOutput: true,
reason: 'Provider flagged completion by content filter',
},
};
}

// Default: pass-through when no guardrails signal present
return {
output: successOutput,
guardrails: { flagged: false },
};
}

Notes:

  • The transform must return an object with output and guardrails at the top level.
  • The guardrails object should reflect whether the input or output was flagged (flaggedInput, flaggedOutput) and include a human-readable reason.
  • For Azure, failed input content safety checks typically return HTTP 400 with code content_filter. In this case, set flagged: true and flaggedInput: true and populate reason from the error.
  • You can also derive guardrail flags from response headers or other metadata available in context.response.

See also: