Amplified Adversarial Suffix Generation

Description: Large language models (LLMs) are vulnerable to jailbreaking attacks using adversarially generated suffixes. The AmpleGCG attack generates a large number of diverse, effective suffixes which bypass safety mechanisms in both open and closed-source LLMs. The attack leverages the observation that low loss during suffix generation is not a reliable indicator of jailbreaking success, and generates diverse suffixes from intermediate steps of the optimization process.

Examples: See https://github.com/OSU-NLP-Group/AmpleGCG. Specific examples are included in the AmpleGCG paper's supplementary materials.

Impact: Successful jailbreaking attacks can lead to the LLM generating harmful, unethical, or otherwise undesired outputs, circumventing built-in safety mechanisms. This compromises the safety and reliability of the LLM and applications which use it. The speed and transferability of the AmpleGCG attack make mitigation challenging.

Affected Systems: Open-source LLMs (Llama-2-7B-chat, Vicuna-7B, Mistral-7B-Instruct) and closed-source LLMs (GPT-3.5, GPT-4). Potentially affects other LLMs with similar architectures and safety mechanisms.

Mitigation Steps:

Improve the robustness of loss functions used in LLM safety training and evaluation.
Develop and deploy more sophisticated detection mechanisms for adversarial suffixes, including those that go beyond simple perplexity checks.
Explore incorporating diverse sets of adversarial examples into the training process to increase model resilience.
Conduct thorough red-teaming using methods like AmpleGCG.
Regular updates and patching to address vulnerabilities as they are discovered.

Amplified Adversarial Suffix Generation

Research Paper