LMVD-ID: 54371640
Published May 1, 2024

Multi-turn Semantic Jailbreak

Affected Models:gpt-3.5-turbo, vicuna-13b-v1.5-16k, llama2-7b-chat-hf, chatglm2-6b, baichuan2-7b-chat

Research Paper

Chain of attack: a semantic-driven contextual multi-turn attacker for llm

View Paper

Description: A vulnerability in large language models (LLMs) allows attackers to elicit unsafe or unethical responses through a chain of semantically relevant multi-turn prompts. The attack, termed "Chain of Attack" (CoA), exploits the model's contextual understanding and adaptive response capabilities to gradually steer the conversation towards the desired harmful output, even if single-turn prompts are rejected due to safety mechanisms. The attack leverages semantic similarity scoring (e.g., using SIMCSE) to guide the prompt generation and ensure a progressive increase in relevance to the target objective.

Examples: See the paper's GitHub repository: https://github.com/YancyKahn/CoA. The repository contains examples demonstrating multi-turn attacks against various LLMs, progressing from safe initial prompts to prompts eliciting responses detailing bomb-making instructions. Figure 1 and Figure 8 in the paper show example attack chains.

Impact: Successful exploitation can lead to the generation of harmful content, including instructions for creating dangerous devices, promoting illegal activities, or spreading misinformation. This compromises the safety and integrity of the LLM and can have real-world consequences.

Affected Systems: Various LLMs susceptible to multi-turn attacks, including but not limited to Vicuna-13b-v1.5-16k, Llama2-7b-chat-hf, Chatglm2-6b, Baichuan2-7b-chat, and GPT-3.5-turbo (as tested in the research paper). The vulnerability is likely present in other similarly designed LLMs.

Mitigation Steps:

  • Improved Safety Mechanisms: Develop more robust safety mechanisms that are resistant to multi-turn attacks. This might involve more sophisticated contextual analysis, detection of semantic drift towards unsafe topics, and more effective prompt filtering.
  • Reinforcement Learning from Human Feedback (RLHF) Enhancements: Improve RLHF training to better handle adversarial prompts and avoid generating harmful responses even in multi-turn contexts.
  • Adversarial Training: Train LLMs on adversarial examples generated by techniques like CoA to improve their robustness against such attacks.
  • Output Monitoring and Filtering: Implement real-time monitoring and filtering of LLM outputs to detect and block unsafe responses, even if the underlying prompt itself does not appear malicious.

© 2025 Promptfoo. All rights reserved.