LMVD-ID: 8d5ca3fa
Published June 1, 2024

Few-Shot LLM Jailbreak

Affected Models:GPT-4, Llama 2 7B, Llama 3 8B, Mistral 7B, Openchat-3.5, Qwen1.5-7B-chat, Starling-lm-7B

Research Paper

Improved few-shot jailbreaking can circumvent aligned language models and their defenses

View Paper

Description: A vulnerability in aligned Large Language Models (LLMs) allows circumvention of safety mechanisms through improved few-shot jailbreaking techniques. The attack leverages injection of special system tokens (e.g., [/INST]) into few-shot demonstrations and demo-level random search to optimize the probability of generating harmful responses. This bypasses defenses that rely on perplexity filtering and input perturbation.

Examples: See https://github.com/sail-sg/I-FSJ

Impact: Successful exploitation allows attackers to elicit harmful or toxic content from the LLM, bypassing its safety alignment. The impact depends on the LLM's deployment and its intended use; it could range from generating inappropriate content to enabling malicious activities.

Affected Systems: Various open-source and closed-source aligned LLMs, including but not limited to Llama-2-7B, Llama-3-8B, OpenChat-3.5, Starling-LM, and Qwen1.5-7B-Chat. The vulnerability is particularly effective against models with limited context windows.

Mitigation Steps:

  • Implement robust input validation and filtering techniques that go beyond simple perplexity checks.
  • Develop and deploy more sophisticated defenses such as those utilizing multiple layers of safety checks and incorporating techniques resilient to few-shot adversarial attacks.
  • Regularly update and retrain LLMs with improved safety datasets and techniques designed to mitigate the specific vulnerabilities highlighted in this research.
  • Investigate and implement mechanisms to detect and counteract the injection of specific system tokens used in the attack.

© 2025 Promptfoo. All rights reserved.