LMVD-ID: 13c741bf
Published November 1, 2025

Pervasive Multi-turn Jailbreaks

Affected Models:GPT-4o 20B, Llama 3.3 70B, Mistral Large, DeepSeek-V3, Qwen 2.5 32B, Gemma 1B, Phi-4

Research Paper

Death by a Thousand Prompts: Open Model Vulnerability Analysis

View Paper

Description: Multiple open-weight Large Language Models (LLMs)—specifically those prioritizing capability over safety alignment—exhibit a critical vulnerability to adaptive multi-turn prompt injection and jailbreak attacks. While these models effectively reject isolated, single-turn adversarial inputs (averaging ~13.11% Attack Success Rate), they fail to maintain safety guardrails and policy enforcement across extended conversational contexts. By leveraging iterative strategies such as "Crescendo" (gradual escalation), "Contextual Ambiguity," and "Role-Play," attackers can bypass safety filters. In automated testing, this vulnerability resulted in Attack Success Rates (ASR) increasing by 2x to 10x, reaching up to 92.78% in Mistral Large-2 and 86.18% in Qwen3-32B. The vulnerability stems from the models' inability to retain forceful rejection states or detect intent drift over long context windows.

Examples: Specific verbatim prompt logs were not disclosed in the text, but the vulnerability is reproducible using the following adaptive strategies described in the assessment methodology:

  1. Crescendo Attack:
  • Method: The attacker begins with benign queries to establish trust and context.
  • Execution: The attacker requests "harmless" coding segments or fictional scenarios. Over 5-10 turns, the attacker subtly introduces adversarial elements, refining the prompt based on previous model responses until the model outputs a full malicious script or prohibited content.
  • Reference: See "Analysis" section regarding the evolution from harmless code to malicious scripts.
  1. Strategy Implementation:
  • The vulnerability is triggered using the Cisco AI Validation methodology or similar automated red-teaming frameworks using an "Attacker Model" (e.g., GPT-3.5 Turbo) configured with the following specific strategies:
    • Information Decomposition and Reassembly
    • Role-Play
    • Refusal Reframe
  • See "Section 4.2 Multi-Turn Evaluation" for parameter details.

Impact:

  • Safety Guardrail Bypass: Attackers can force the model to generate content explicitly prohibited by its safety training, including hate speech, misinformation, and sexually explicit text.
  • Malicious Code Generation: High-capability models can be steered to generate functional malware or exploit scripts that are refused in single-turn contexts.
  • Contextual Manipulation: In RAG (Retrieval-Augmented Generation) or agentic deployments, this vulnerability allows attackers to manipulate the model's context to exfiltrate data or perform unauthorized actions via tool-use capabilities.

Affected Systems: The vulnerability was confirmed in the following open-weight models (specific versions tested):

  • Mistral: Large-2 (Large-Instruct-2047) - 92.78% Multi-turn ASR
  • Alibaba: Qwen3-32B - 86.18% Multi-turn ASR
  • Meta: Llama 3.3-70B-Instruct
  • DeepSeek: v3.1
  • Microsoft: Phi-4
  • Zhipu AI: GLM 4.5-Air
  • OpenAI: GPT-OSS-20b
  • Google: Gemma 3-1B-IT (Lowest susceptibility, but still affected)

Mitigation Steps:

  • Context-Aware Guardrails: Implement third-party, model-agnostic runtime guardrails that monitor the full conversation history (input and output) rather than isolated prompts to detect intent drift.
  • System Prompt Hardening: Deploy strict, use-case-specific meta-prompts that explicitly prevent user instructions from overriding core safety rules, regardless of conversation length.
  • Adversarial Fine-tuning: Perform fine-tuning on datasets containing multi-turn adversarial examples to teach the model to maintain refusals across extended interactions.
  • Rate Limiting and Monitoring: Implement rate limiting to disrupt automated iterative attacks and log all inputs/responses for anomaly detection.
  • Input/Output Filtering: Do not rely solely on the model's internal alignment; use external filtering mechanisms to reject off-topic or policy-violating requests before they reach the model.
  • Scope Restriction: Strictly scope the model's integration with downstream plugins or automation tools to prevent high-impact consequences if the model is jailbroken.

© 2026 Promptfoo. All rights reserved.