LMVD-ID: 3928ac25
Published April 1, 2024

Amplified Adversarial Suffix Generation

Affected Models:llama-2-7b-chat, vicuna-7b, gpt-3.5, gpt-4, mistral-7b

Research Paper

Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms

View Paper

Description: Large language models (LLMs) are vulnerable to jailbreaking attacks using adversarially generated suffixes. The AmpleGCG attack generates a large number of diverse, effective suffixes which bypass safety mechanisms in both open and closed-source LLMs. The attack leverages the observation that low loss during suffix generation is not a reliable indicator of jailbreaking success, and generates diverse suffixes from intermediate steps of the optimization process.

Examples: See https://github.com/OSU-NLP-Group/AmpleGCG. Specific examples are included in the AmpleGCG paper's supplementary materials.

Impact: Successful jailbreaking attacks can lead to the LLM generating harmful, unethical, or otherwise undesired outputs, circumventing built-in safety mechanisms. This compromises the safety and reliability of the LLM and applications which use it. The speed and transferability of the AmpleGCG attack make mitigation challenging.

Affected Systems: Open-source LLMs (Llama-2-7B-chat, Vicuna-7B, Mistral-7B-Instruct) and closed-source LLMs (GPT-3.5, GPT-4). Potentially affects other LLMs with similar architectures and safety mechanisms.

Mitigation Steps:

  • Improve the robustness of loss functions used in LLM safety training and evaluation.
  • Develop and deploy more sophisticated detection mechanisms for adversarial suffixes, including those that go beyond simple perplexity checks.
  • Explore incorporating diverse sets of adversarial examples into the training process to increase model resilience.
  • Conduct thorough red-teaming using methods like AmpleGCG.
  • Regular updates and patching to address vulnerabilities as they are discovered.

© 2025 Promptfoo. All rights reserved.