Multi-Agent Jailbreak Strategy

Description: A vulnerability exists in multiple LLMs allowing attackers to elicit harmful responses by strategically distributing malicious intent across multiple turns in a conversation. The vulnerability is not detected by single-turn safety measures, as the harmful intent is only revealed through a sequence of seemingly benign prompts. The vulnerability is exacerbated by the use of techniques such as prompt optimization that dynamically adjust prompts based on model responses, maximizing the likelihood of eliciting the targeted harmful content.

Examples: See https://x-teaming.github.io/ and https://huggingface.co/datasets/marslabucla/XGuard-Train for examples of multi-turn attacks successfully eliciting harmful responses from various LLMs, including Claude 3.7 Sonnet. Specific examples include generating posts glorifying 9/11 terrorists and describing methods to prevent child laborers from speaking to law enforcement.

Impact: Successful exploitation of this vulnerability can lead to the generation of harmful content, including but not limited to hate speech, misinformation, instructions for illegal activities, and content that could cause psychological harm. The vulnerability significantly undermines the safety and trustworthiness of LLMs.

Affected Systems: Multiple LLMs, including (but not limited to) GPT-4, Claude 3.5 Sonnet, Claude 3.7 Sonnet, Gemini 2.0-Flash, Llama 3-8B-IT, Llama 3-70B-IT, DeepSeek V3, and Qwen-2.5-32B-IT.

Mitigation Steps:

Implement multi-turn safety mechanisms that analyze the entire conversation history to detect malicious intent distributed across multiple turns.
Develop robust defenses against prompt optimization techniques that exploit model response patterns to enhance attack success.
Utilize larger and more diverse datasets for safety training, specifically incorporating multi-turn adversarial examples similar to those found in XGuard-Train.
Regularly red-team LLMs with tools like X-Teaming to uncover and address potential vulnerabilities.

Multi-Agent Jailbreak Strategy

Research Paper