LLM Guardrail Benchmark Lies
Research Paper
When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift
View PaperDescription:
Architectural limitations in Meta's Llama-Prompt-Guard-2-86M and Llama-Guard-3-8B cause them to fail at detecting indirect prompt injections and agentic tool-use attacks, with detection rates dropping as low as 7-37%. Llama-Guard-3-8B enforces strict user/assistant message alternation and lacks support for tool-use roles; attempting to process messages with role: "tool" or role: "ipython" causes the chat template to raise an error, preventing evaluation entirely. PromptGuard 2 operates strictly on raw text without chat template support, blinding it to structural message boundaries and tool provenance. Consequently, attackers can bypass these safety guardrails and hijack LLM agents by embedding malicious instructions within external documents, API responses, or tool outputs.
Examples:
- LlamaGuard Tool Role Error: When an LLM agent processes a tool response containing a malicious payload (e.g., from the InjecAgent dataset), the message structure includes
role: "tool"orrole: "ipython". Passing this conversation directly to LlamaGuard for safety evaluation causes a chat template application failure because LlamaGuard's template strictly expects alternatinguserandassistantroles and silently ignores thetoolsparameter, resulting in an unhandled exception or bypassed check. - PromptGuard Context Flattening Bypass: Because PromptGuard 2 lacks chat template support, multi-turn agentic workflows must be concatenated into raw text (e.g.,
system: [...] user: [...] assistant: [...]). Attackers can exploit this flat structure by injecting role-playing headers or context-switching prompts inside retrieved text or tool responses (e.g., BIPIA dataset indirect injections). PromptGuard evaluates the flattened string without understanding the structural boundary between trusted instructions and untrusted external data, leading to a failure to detect the injection. - See the InjecAgent (arXiv:2403.02691) and BIPIA (arXiv:2501.00288) datasets for reproducible examples of tool-use and indirect prompt injection attacks that bypass these guardrails.
Impact: Attackers can successfully execute indirect prompt injections and hijack agentic workflows. By placing malicious commands inside tool outputs, emails, or retrieved documents, an attacker can manipulate the LLM agent into executing unauthorized actions, exfiltrating data, or bypassing intended system restrictions without triggering the deployment's safety guardrails.
Affected Systems:
- Meta Llama-Guard-3-8B (when used in agentic/tool-use workflows)
- Meta Llama-Prompt-Guard-2-86M
- Agentic LLM pipelines relying on these conversational guardrails for input/output sanitization.
Mitigation Steps:
- Implement Activation-Based Probes: Extract activations from the underlying LLM's residual stream (e.g., layer 31, at the last token of the user message before the generation prompt) and use linear probes (logistic regression) to classify malicious inputs. This natively incorporates the model's chat template and tool schema understanding, achieving significantly higher detection rates (up to 99%) on indirect and tool-use injections.
- Adopt Leave-One-Dataset-Out (LODO) Evaluation: When training or evaluating prompt attack classifiers, hold out entire datasets during training rather than using standard train-test splits from the same sources. This prevents the classifier from learning dataset-specific shortcuts (e.g., specific email formatting) and accurately measures true out-of-distribution generalization.
- Filter Dataset Shortcuts for Interpretability: If using Sparse Autoencoder (SAE) features for safety classification or interpretability, weight the feature explanations by their LODO coefficient retention scores to filter out context-dependent dataset artifacts and surface genuinely predictive attack features.
- Calibrate Thresholds per Target Distribution: Recognize that calibration is distribution-dependent. Do not rely on standard classification thresholds (e.g., t=0.5); instead, optimize operating points on held-out data representative of the specific deployment environment.
© 2026 Promptfoo. All rights reserved.