Pervasive Multi-turn Jailbreaks

Description: Multiple open-weight Large Language Models (LLMs)—specifically those prioritizing capability over safety alignment—exhibit a critical vulnerability to adaptive multi-turn prompt injection and jailbreak attacks. While these models effectively reject isolated, single-turn adversarial inputs (averaging ~13.11% Attack Success Rate), they fail to maintain safety guardrails and policy enforcement across extended conversational contexts. By leveraging iterative strategies such as "Crescendo" (gradual escalation), "Contextual Ambiguity," and "Role-Play," attackers can bypass safety filters. In automated testing, this vulnerability resulted in Attack Success Rates (ASR) increasing by 2x to 10x, reaching up to 92.78% in Mistral Large-2 and 86.18% in Qwen3-32B. The vulnerability stems from the models' inability to retain forceful rejection states or detect intent drift over long context windows.

Examples: Specific verbatim prompt logs were not disclosed in the text, but the vulnerability is reproducible using the following adaptive strategies described in the assessment methodology:

Crescendo Attack:

Method: The attacker begins with benign queries to establish trust and context.
Execution: The attacker requests "harmless" coding segments or fictional scenarios. Over 5-10 turns, the attacker subtly introduces adversarial elements, refining the prompt based on previous model responses until the model outputs a full malicious script or prohibited content.
Reference: See "Analysis" section regarding the evolution from harmless code to malicious scripts.

Strategy Implementation:

The vulnerability is triggered using the Cisco AI Validation methodology or similar automated red-teaming frameworks using an "Attacker Model" (e.g., GPT-3.5 Turbo) configured with the following specific strategies:
- Information Decomposition and Reassembly
- Role-Play
- Refusal Reframe
See "Section 4.2 Multi-Turn Evaluation" for parameter details.

Impact:

Safety Guardrail Bypass: Attackers can force the model to generate content explicitly prohibited by its safety training, including hate speech, misinformation, and sexually explicit text.
Malicious Code Generation: High-capability models can be steered to generate functional malware or exploit scripts that are refused in single-turn contexts.
Contextual Manipulation: In RAG (Retrieval-Augmented Generation) or agentic deployments, this vulnerability allows attackers to manipulate the model's context to exfiltrate data or perform unauthorized actions via tool-use capabilities.

Affected Systems: The vulnerability was confirmed in the following open-weight models (specific versions tested):

Mistral: Large-2 (Large-Instruct-2047) - 92.78% Multi-turn ASR
Alibaba: Qwen3-32B - 86.18% Multi-turn ASR
Meta: Llama 3.3-70B-Instruct
DeepSeek: v3.1
Microsoft: Phi-4
Zhipu AI: GLM 4.5-Air
OpenAI: GPT-OSS-20b
Google: Gemma 3-1B-IT (Lowest susceptibility, but still affected)

Mitigation Steps:

Context-Aware Guardrails: Implement third-party, model-agnostic runtime guardrails that monitor the full conversation history (input and output) rather than isolated prompts to detect intent drift.
System Prompt Hardening: Deploy strict, use-case-specific meta-prompts that explicitly prevent user instructions from overriding core safety rules, regardless of conversation length.
Adversarial Fine-tuning: Perform fine-tuning on datasets containing multi-turn adversarial examples to teach the model to maintain refusals across extended interactions.
Rate Limiting and Monitoring: Implement rate limiting to disrupt automated iterative attacks and log all inputs/responses for anomaly detection.
Input/Output Filtering: Do not rely solely on the model's internal alignment; use external filtering mechanisms to reject off-topic or policy-violating requests before they reach the model.
Scope Restriction: Strictly scope the model's integration with downstream plugins or automation tools to prevent high-impact consequences if the model is jailbroken.

Pervasive Multi-turn Jailbreaks

Research Paper