Exponentiated Gradient Jailbreak
Research Paper
Adversarial Attack on Large Language Models using Exponentiated Gradient Descent
View PaperDescription: A vulnerability in several open-source Large Language Models (LLMs) allows attackers using exponentiated gradient descent to craft adversarial prompts that cause the models to generate harmful or unintended outputs, effectively "jailbreaking" the safety alignment mechanisms. The attack optimizes a continuous relaxed one-hot encoding of the input tokens, intrinsically satisfying constraints, and avoiding the need for projection techniques used in previous methods.
Examples: See https://github.com/sbamit/Exponentiated-Gradient-Descent-LLM-Attack for the implementation and examples of adversarial prompts generated by the attack. Specific examples of successful attacks on different models and datasets are provided in Appendix B of the referenced paper.
Impact: Successful exploitation of this vulnerability allows attackers to bypass safety mechanisms in LLMs and elicit harmful, offensive, or otherwise inappropriate outputs for various user prompts. This can lead to the dissemination of malicious information, the generation of harmful content, or other negative consequences depending on the context of LLM deployment.
Affected Systems: The vulnerability affects multiple open-source LLMs, including Llama2-7B-chat, Falcon-7B-Instruct, MPT-7BChat, Mistral-7B-v0.3, and Vicuna-7B-v1.5. The attack is a white-box attack requiring access to the model's internal weights.
Mitigation Steps:
- The paper does not offer specific model-level mitigation strategies, but suggests that improved regularization techniques, stronger safety training, and potentially advanced defense measures against gradient-based attacks may offer partial mitigation in the future. Improved robustness against adversarial examples in general is implied as a mitigation strategy.
- Restrict access to the model weights to prevent white-box attacks. Implementation of appropriate input sanitization and filtering may reduce the impact of successful attacks.
- Further research into this attack method and other similar optimization-based attacks is needed to identify more effective mitigations.
© 2025 Promptfoo. All rights reserved.