Amplified Adversarial Suffix Generation
Research Paper
Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms
View PaperDescription: Large language models (LLMs) are vulnerable to jailbreaking attacks using adversarially generated suffixes. The AmpleGCG attack generates a large number of diverse, effective suffixes which bypass safety mechanisms in both open and closed-source LLMs. The attack leverages the observation that low loss during suffix generation is not a reliable indicator of jailbreaking success, and generates diverse suffixes from intermediate steps of the optimization process.
Examples: See https://github.com/OSU-NLP-Group/AmpleGCG. Specific examples are included in the AmpleGCG paper's supplementary materials.
Impact: Successful jailbreaking attacks can lead to the LLM generating harmful, unethical, or otherwise undesired outputs, circumventing built-in safety mechanisms. This compromises the safety and reliability of the LLM and applications which use it. The speed and transferability of the AmpleGCG attack make mitigation challenging.
Affected Systems: Open-source LLMs (Llama-2-7B-chat, Vicuna-7B, Mistral-7B-Instruct) and closed-source LLMs (GPT-3.5, GPT-4). Potentially affects other LLMs with similar architectures and safety mechanisms.
Mitigation Steps:
- Improve the robustness of loss functions used in LLM safety training and evaluation.
- Develop and deploy more sophisticated detection mechanisms for adversarial suffixes, including those that go beyond simple perplexity checks.
- Explore incorporating diverse sets of adversarial examples into the training process to increase model resilience.
- Conduct thorough red-teaming using methods like AmpleGCG.
- Regular updates and patching to address vulnerabilities as they are discovered.
© 2025 Promptfoo. All rights reserved.