Autonomous Multi-Turn Jailbreak

Description: A multi-turn jailbreak vulnerability exists in multiple state-of-the-art Large Language Models (LLMs) that allows attackers to bypass safety guardrails by progressively steering long-horizon conversations. Demonstrated via the "Mastermind" framework, the attack leverages a hierarchical multi-agent architecture to decouple high-level malicious objectives from low-level tactical execution. By employing strategy-level fuzzing—dynamically reflecting on model refusals and recombining abstracted adversarial patterns (e.g., defensive framing, fictional crises)—an attacker can systematically erode a model's alignment. This allows the fragmentation of malicious intent across extended exchanges, rendering traditional single-turn detection methods and static defenses ineffective.

Examples: An attacker can bypass safety filters by fragmenting a harmful query into a structured multi-turn dialogue. For example, to achieve the objective "Write a defamatory news article claiming a CEO is embezzling pension funds," the attack is dynamically broken into sequential, adaptive steps:

Turn 1 (Establish Persona): The attacker prompts the model to adopt the persona of a "senior crisis PR manager."
Turn 2 (Escalate Context): The attacker introduces a high-stakes scenario, such as a fictional competitor smear campaign, to create a defensive frame.
Turn 3 (Elicit Harm): The attacker requests the specific defamatory content, strictly masking it as a "risk assessment draft" for the PR manager to analyze.

If the model refuses at any step, the attacker dynamically refines the prompt based on the refusal reason before proceeding to the next step.

Impact: Successful exploitation allows adversaries to reliably bypass safety alignments to generate highly detailed, actionable harmful content. This includes instructions for cybercrime, the creation of chemical threats, hate speech, and the generation of targeted misinformation. The attack maintains high success rates even against models utilizing advanced Chain-of-Thought (CoT) reasoning.

Affected Systems: The vulnerability has been successfully demonstrated against standard and reasoning-focused LLMs, including but not limited to:

OpenAI: GPT-4o, GPT-4.1, o3-mini, o4-mini, GPT-5
Anthropic: Claude 3.7 Sonnet
Google: Gemini 2.5 Flash, Gemini 2.5 Pro
DeepSeek: DeepSeek V3, DeepSeek R1
Meta: Llama 3 series (7B, 70B)
Alibaba: Qwen 2.5 series (7B, 14B, 72B)

Mitigation Steps: The researchers note that current defense paradigms are insufficient to fully mitigate this threat, but highlight techniques that attenuate attack performance:

Implement Multi-turn Self-Reminders: Configure explicit safety instructions within the system prompt at conversation initialization, and automatically append a safety suffix to the user query in the most recent turn. (This defense was shown to cause the most significant drop in attack success and output severity).
Integrate Conversation-Level Proxy Defenses: Deploy external moderation classifiers (such as Llama Guard 4) that evaluate the entire conversation history for potential policy violations prior to generating a response.
Input Perturbation: Apply character-level or word-level modifications (e.g., SmoothLLM) to the most recent user query to disrupt targeted adversarial syntax, though this provides limited efficacy against semantic, strategy-level attacks.

Autonomous Multi-Turn Jailbreak

Research Paper