Transfer-Based LLM Jailbreak
Research Paper
Improved Generation of Adversarial Examples Against Safety-aligned LLMs
View PaperDescription: This vulnerability allows attackers to bypass safety mechanisms in Llama-2-7B-Chat and other safety-aligned LLMs using crafted adversarial prompts. The vulnerability stems from a gap between the gradient of the adversarial loss with respect to the one-hot representation of tokens and the actual effect of token replacements on the model's output. This gap allows for the generation of adversarial prompts that elicit harmful responses despite safety training. The paper demonstrates that techniques inspired by transfer-based attacks against image classification models can significantly improve the success rate of these adversarial attacks.
Examples: See https://github.com/qizhangli/Gradient-based-Jailbreak. The repository contains code and implementation details for generating adversarial prompts. Specific examples of adversarial prompts and resulting outputs are within that repository.
Impact: Successful exploitation allows attackers to induce safety-aligned LLMs to generate harmful content, including but not limited to hate speech, misinformation, instructions for illegal activities, and personal information disclosure. This undermines the intended safety mechanisms and poses significant security risks.
Affected Systems: Llama-2-7B-Chat, Llama-2-13B-Chat, Mistral-7B-Instruct-v0.2, and Phi3-Mini-4K-Instruct LLMs, and potentially other models with similar architectures and safety training mechanisms.
Mitigation Steps:
- Improve gradient-based optimization techniques for adversarial prompt generation to better account for the discrete nature of text and reduce the gap between computed gradients and actual loss changes from token replacements.
- Develop more robust safety mechanisms that are less susceptible to adversarial attacks. This may include techniques beyond gradient-based training.
- Employ adversarial training methods to enhance model robustness against crafted adversarial prompts.
- Implement input sanitization and filtering techniques to detect and block malicious or potentially harmful prompts.
© 2025 Promptfoo. All rights reserved.