RL Multi-Turn Jailbreak

Description: Large Language Models (LLMs) are vulnerable to automated, black-box, multi-turn jailbreak attacks coordinated by an adversarial agent trained via Reinforcement Learning (RL). The vulnerability exists because standard safety alignments often optimize for single-turn refusals, failing to account for trajectory-level planning. The attack method, dubbed RL-MTJail, utilizes a multi-turn variant of Group Relative Policy Optimization (GRPO) to train an attacker LLM. This agent optimizes for the harmfulness of the final response in a conversation while utilizing two heuristic process rewards: (1) Over-harm mitigation, which penalizes intermediate prompts that trigger immediate refusal (thereby keeping the victim model compliant during the setup phase), and (2) Target-guided progression, which rewards gradual semantic convergence toward the prohibited topic. This allows the attacker to distribute malicious intent across multiple turns, effectively bypassing safety guardrails in models such as Llama-3, Qwen2.5, and Gemma-2.

Examples: Specific dialogue transcripts are not provided in the text, as they contain harmful content redacted in the source figures. The attack involves an iterative conversation where the attacker agent generates prompts $x_t$ based on the history $\tau_{t-1}$ to elicit a final harmful response $y$.

For implementation code and dataset details used to reproduce the attack, see the official repository: https://github.com/xxiqiao/RL-MTJail

Impact:

Safety Bypass: Circumvention of safety mechanisms (including refusals and guardrails) in state-of-the-art aligned LLMs.
Content Generation: Generation of harmful, toxic, or illegal content (e.g., hate speech, malware creation instructions, illegal acts) that the model is aligned to refuse.
Transferability: Attack policies trained on one victim model (e.g., Qwen2.5) successfully transfer to other, more robust models (e.g., Llama-3.1, Gemma-2), indicating a generalized vulnerability in current LLM safety training.

Affected Systems: The vulnerability has been confirmed on the following models when served as black-box APIs or standalone instances:

Qwen2.5-7B-Instruct
Llama-3.1-8B-Instruct
Gemma-2-9B-IT
Mistral-7B-Instruct-v0.3

Mitigation Steps:

Multi-Agent Co-Training: Deploy adaptive or adversarially trained safety modules that create non-stationary environments, disrupting the static policy optimization of the attacker.
Trajectory-Level Evaluation: Implement safety guardrails that evaluate the cumulative semantic drift and harmfulness of the entire conversation history, rather than evaluating prompt-response pairs in isolation.
Adversarial Defenses: Incorporate multi-turn adversarial examples (specifically those generated by RL agents) into the safety alignment training data to improve robustness against long-term strategy planning.

RL Multi-Turn Jailbreak

Research Paper