Pervasive Multi-turn Jailbreaks
Research Paper
Death by a Thousand Prompts: Open Model Vulnerability Analysis
View PaperDescription: Multiple open-weight Large Language Models (LLMs)—specifically those prioritizing capability over safety alignment—exhibit a critical vulnerability to adaptive multi-turn prompt injection and jailbreak attacks. While these models effectively reject isolated, single-turn adversarial inputs (averaging ~13.11% Attack Success Rate), they fail to maintain safety guardrails and policy enforcement across extended conversational contexts. By leveraging iterative strategies such as "Crescendo" (gradual escalation), "Contextual Ambiguity," and "Role-Play," attackers can bypass safety filters. In automated testing, this vulnerability resulted in Attack Success Rates (ASR) increasing by 2x to 10x, reaching up to 92.78% in Mistral Large-2 and 86.18% in Qwen3-32B. The vulnerability stems from the models' inability to retain forceful rejection states or detect intent drift over long context windows.
Examples: Specific verbatim prompt logs were not disclosed in the text, but the vulnerability is reproducible using the following adaptive strategies described in the assessment methodology:
- Crescendo Attack:
- Method: The attacker begins with benign queries to establish trust and context.
- Execution: The attacker requests "harmless" coding segments or fictional scenarios. Over 5-10 turns, the attacker subtly introduces adversarial elements, refining the prompt based on previous model responses until the model outputs a full malicious script or prohibited content.
- Reference: See "Analysis" section regarding the evolution from harmless code to malicious scripts.
- Strategy Implementation:
- The vulnerability is triggered using the Cisco AI Validation methodology or similar automated red-teaming frameworks using an "Attacker Model" (e.g., GPT-3.5 Turbo) configured with the following specific strategies:
- Information Decomposition and Reassembly
- Role-Play
- Refusal Reframe
- See "Section 4.2 Multi-Turn Evaluation" for parameter details.
Impact:
- Safety Guardrail Bypass: Attackers can force the model to generate content explicitly prohibited by its safety training, including hate speech, misinformation, and sexually explicit text.
- Malicious Code Generation: High-capability models can be steered to generate functional malware or exploit scripts that are refused in single-turn contexts.
- Contextual Manipulation: In RAG (Retrieval-Augmented Generation) or agentic deployments, this vulnerability allows attackers to manipulate the model's context to exfiltrate data or perform unauthorized actions via tool-use capabilities.
Affected Systems: The vulnerability was confirmed in the following open-weight models (specific versions tested):
- Mistral: Large-2 (Large-Instruct-2047) - 92.78% Multi-turn ASR
- Alibaba: Qwen3-32B - 86.18% Multi-turn ASR
- Meta: Llama 3.3-70B-Instruct
- DeepSeek: v3.1
- Microsoft: Phi-4
- Zhipu AI: GLM 4.5-Air
- OpenAI: GPT-OSS-20b
- Google: Gemma 3-1B-IT (Lowest susceptibility, but still affected)
Mitigation Steps:
- Context-Aware Guardrails: Implement third-party, model-agnostic runtime guardrails that monitor the full conversation history (input and output) rather than isolated prompts to detect intent drift.
- System Prompt Hardening: Deploy strict, use-case-specific meta-prompts that explicitly prevent user instructions from overriding core safety rules, regardless of conversation length.
- Adversarial Fine-tuning: Perform fine-tuning on datasets containing multi-turn adversarial examples to teach the model to maintain refusals across extended interactions.
- Rate Limiting and Monitoring: Implement rate limiting to disrupt automated iterative attacks and log all inputs/responses for anomaly detection.
- Input/Output Filtering: Do not rely solely on the model's internal alignment; use external filtering mechanisms to reject off-topic or policy-violating requests before they reach the model.
- Scope Restriction: Strictly scope the model's integration with downstream plugins or automation tools to prevent high-impact consequences if the model is jailbroken.
© 2026 Promptfoo. All rights reserved.