Jailbreak Vulnerabilities

Methods for bypassing model safety measures

Related Vulnerabilities

364 entries

Automated Red-Teaming Achieves 100% ASR

8/16/2025

Large Language Models (LLMs) are vulnerable to automated adversarial attacks that systematically combine multiple jailbreaking "primitives" into complex prompt chains. A dynamic optimization engine can generate and test billions of unique combinations of techniques (e.g., low-resource language translation, payload splitting, role-playing) to bypass safety guardrails. This combinatorial approach differs from manual red-teaming by systematically exploring the attack surface, achieving near-universal success in eliciting harmful content. The vulnerability lies in the models' inability to maintain safety alignment when faced with a sequence of layered obfuscation and manipulation techniques.

LLM Robustness Leaderboard v1--Technical report
Affects: deepseek-r1, grok-2, claude-3.5-sonnet, qwen2.5-7b, falcon3-10b, llama-3.2-11b, phi-4, llama-3.2-1b, llama-3.1-405b, 1lama-3.1-8b, gemma-2-27b-it, llama-3.2-90b, qwen-2.5-72b, llama-3.3-70b, llama-3.1-70b, deepseek-v3, mixtral-8x22b, pixtral-large-2411, granite-3.1-8b, mixtral-8x7b, ministral-8b, mistral-nemo, claude-3.5-sonnet 20241022, o1 2024-12-17, claude-3-haiku, claude-3.5-haiku 20241022, gpt-4o-2024-11-20, o3-mini 2025-01-14, claude-3-opus, gpt-4o-2024-08-06, nova-pro-v1, qwen-max, claude-3-sonnet, gpt-4o-mini-2024-07-18, nova-micro-v1, yi-large, nova-lite-v1, qwen-plus, gemini-pro-1.5, grok-2-1212, falcon3-10b-instruct, llama-3.1-405b-instruct, llama-3.1-70b-instruct, llama-3.1-8b-instruct, llama-3.2-11b-vision-instruct, llama-3.2-1b-instruct, llama-3.2-90b-vision-instruct, 1lama-3.3-70b-instruct, mixtral-8x22b-instruct, mixtral-8x7b-instruct
Page 1 of 37

© 2025 Promptfoo. All rights reserved.