Jailbreak Vulnerabilities

Methods for bypassing model safety measures

Related Vulnerabilities

386 entries

Code Agent Executable Jailbreaks

10/13/2025

AI code agents are vulnerable to jailbreaking attacks that cause them to generate or complete malicious code. The vulnerability is significantly amplified when a base Large Language Model (LLM) is integrated into an agentic framework that uses multi-step planning and tool-use. Initial safety refusals by the LLM are frequently overturned during subsequent planning or self-correction steps within the agent's reasoning loop.

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Affects: gpt-4.1, gpt-o1, deepseek-r1, qwen3-235b, mistral large 2.1, llama-3.1-70b, llama-3-8b, claude-3.7-sonnet, dolphinmistral-24b-venice

Overfitting-induced Benign Jailbreak

10/13/2025

A vulnerability exists in Large Language Models (LLMs) that support fine-tuning, allowing an attacker to bypass safety alignments using a small, benign dataset. The attack, "Attack via Overfitting," is a two-stage process. In Stage 1, the model is fine-tuned on a small set of benign questions (e.g., 10) paired with identical, repetitive refusal answers. This induces an overfitted state where the model learns to refuse all prompts, creating a sharp minimum in the loss landscape and making it highly sensitive to parameter changes. In Stage 2, the overfitted model is further fine-tuned on the same benign questions, but with their standard, helpful answers. This second fine-tuning step causes catastrophic forgetting of the general refusal behavior, leading to a collapse of safety alignment and causing the model to comply with harmful and malicious instructions. The attack is highly stealthy as the fine-tuning data appears benign to content moderation systems.

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Affects: llama2-7b-chat-hf, llama3-8b-instruct, deepseek-r1-distill-llama3-8b, qwen2.5-7b-instruct, qwen3-8b, gpt-4o-mini, gpt-4.1-mini, gpt-3.5-turbo, gpt-4o, gpt-4.1

Pattern Enhanced Multi-Turn Jailbreaking

11/1/2025

Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models

Affects: claude-3-haiku, gemini-1.5-flash, gemini-1.5-pro, gemini-2.0-flash, gpt-4o-mini, gpt-3.5-turbo, llama2-7b, llama2-13b, llama3-8b, deepseek-chat, mistral-7b-instruct-v0.3, vicuna-13b-v1.5-16k

Persuasive Jailbreak Fingerprint

11/1/2025

Large Language Models (LLMs) are vulnerable to jailbreak attacks that use persuasive techniques grounded in social psychology to bypass safety alignments. Malicious instructions can be reframed using one of Cialdini's seven principles of persuasion (Authority, Reciprocity, Commitment, Social Proof, Liking, Scarcity, and Unity). These rephrased prompts, which remain human-readable and can be generated automatically, manipulate the LLM into complying with harmful requests it would otherwise refuse. The attack's effectiveness varies by principle and by model, revealing distinct "persuasive fingerprints" of susceptibility.

Uncovering the Persuasive Fingerprint of LLMs in Jailbreaking Attacks

Affects: wizardlm-uncensored2, vicuna, llama2, llama3, gemma3, deepseek-r1, phi-4, gpt-2

Schema Exploitation Jailbreak

10/31/2025

A vulnerability exists in Large Language Models where their strong adherence to processing structured data schemas can be exploited to bypass safety mechanisms. The attack, named BreakFun, uses a multi-component prompt that combines an innocent framing, a Chain-of-Thought (CoT) instruction, and a core "Trojan Schema." This schema is an adversarially designed data structure (e.g., a Python class definition) that embeds a harmful user request. By instructing the model to simulate the hypothetical output of code that uses this schema, the model's cognitive resources are misdirected towards fulfilling the structural and syntactic requirements of the task, causing it to overlook and comply with the embedded harmful request.

BreakFun: Jailbreaking LLMs via Schema Exploitation

Affects: gpt-4.1 mini, gemini 2.5 flash, claude-3.5 sonnet, claude-3 haiku, kimi-k2, ernie-4.5, gpt-oss 20b, deepseek-r1, gemma3 12b, qwen3 8b, llama 3.1 8b, mistral 7b, zephyr 7b, qwen3-max

Special Token Jailbreak

10/31/2025

Large Language Models (LLMs) that use special tokens to define conversational structure (e.g., via chat templates) are vulnerable to a jailbreak attack named MetaBreak. An attacker can inject these special tokens, or regular tokens with high semantic similarity in the embedding space, into a user prompt. This manipulation allows the attacker to bypass the model's internal safety alignment and external content moderation systems. The attack leverages four primitives:

Response Injection: Forging an assistant's turn within the user prompt to trick the model into believing it has already started to provide an affirmative response.
Turn Masking: Using a few-shot, word-by-word construction to make the injected response resilient to disruption from platform-added chat template wrappers.
Input Segmentation: Splitting sensitive keywords with injected special tokens to evade detection by content moderators, which may fail to reconstruct the original term, while the more capable target LLM can.
Semantic Mimicry: Bypassing special token sanitization defenses by substituting them with regular tokens that have a minimal L2 norm distance in the embedding space, thereby preserving the token's structural function.

MetaBreak: Jailbreaking Online LLM Services via Special Token Manipulation

Affects: llama-3.3-70b-instruct, qwen-2.5-72b-instruct, gemma-2-27b-instruct, phi4-14b, llamaguard, llama-3.1-405b, llama-3.1-8b, gpt-4.1, claude-opus-4, llamaguard3-8b, promptguard-86m, shieldgemma2-27b

Untargeted Gradient Jailbreak

10/13/2025

A vulnerability exists in certain safety-aligned Large Language Models (LLMs) due to an untargeted, gradient-based optimization attack method called Untargeted Jailbreak Attack (UJA). Unlike previous targeted attacks (e.g., GCG) that optimize a prompt to elicit a predefined string (e.g., "Sure, here is..."), UJA optimizes for a general objective: maximizing the unsafety probability of the model's response, as quantified by an external judge model.

Untargeted Jailbreak Attack

Affects: llama3-8b-instruct, llama3.1-8b-instruct, qwen-2.5-7b-instruct, qwen-3-8b-instruct, vicuna-7b-v1.5, mistral-7b-instruct-v0.3, llama3-70b-instruct, deepseek-r1-671b, roberta, llama-2-13b, gpt-2-large, gpt-4, gpt-4o, llama-guard-3

Camouflaged Jailbreak Prompts Benchmark

10/13/2025

Large Language Models from multiple vendors are vulnerable to a "Camouflaged Jailbreak" attack. Malicious instructions are embedded within seemingly benign, technically complex prompts, often framed as system design or engineering problems. The models fail to recognize the harmful intent implied by the context and technical specifications, bypassing safety filters that rely on detecting explicit keywords. This leads to the generation of detailed, technically plausible instructions for creating dangerous devices or systems. The attack has a high success rate, with models demonstrating full obedience in over 94% of tested cases, treating the harmful requests as legitimate.

Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models

Affects: gpt-4, gpt-4o, llama-3.1-8b-instruct, gemma-3-4b-it, mistral-7b-instruct-v0.3

Chained Tool-Use Injections

10/13/2025

A vulnerability exists in tool-enabled Large Language Model (LLM) agents, termed Sequential Tool Attack Chaining (STAC), where a sequence of individually benign tool calls can be orchestrated to achieve a malicious outcome. An attacker can guide an agent through a multi-turn interaction, with each step appearing harmless in isolation. Safety mechanisms that evaluate individual prompts or actions fail to detect the threat because the malicious intent is distributed across the sequence and only becomes apparent from the cumulative effect of the entire tool chain, typically at the final execution step. This allows the bypass of safety guardrails to execute harmful actions in the agent's environment.

STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents

Affects: gpt-4.1-2025-04-14, gpt-4.1-mini-2025-04-14, qwen3-32b, llama-3.1-405b-instruct, llama3.3-70b-instruct, mistral-large-instruct-2411, mistral-small-3.2-24b-instruct-2506, magistral-small-2506, gpt-4.1

Content Concretization Jailbreak

9/30/2025

A vulnerability, termed "Content Concretization," exists in Large Language Models (LLMs) wherein safety filters can be bypassed by iteratively refining a malicious request. The attack uses a less-constrained, lower-tier LLM to generate a preliminary draft (e.g., pseudocode or a non-executable prototype) of a malicious tool from an abstract prompt. This "concretized" draft is then passed to a more capable, higher-tier LLM. The higher-tier LLM, when prompted to refine or complete the existing draft, is significantly more likely to generate the full malicious, executable content than if it had received the initial abstract prompt directly. This exploits a weakness in safety alignment where models are more permissive in extending existing content compared to generating harmful content from scratch.

Jailbreaking Large Language Models Through Content Concretization

Affects: gpt-4o-mini, claude 3.7 sonnet, claude 3.5 sonnet, claude 3.5 haiku, gemini 2.5 flash preview, gemini 2.5 pro preview, gemini 2.0 flash, gpt-4.1o3, gpt-4o, gpt-o3, gpt-4

Page 1 of 39