Universal Adversarial LLM Jailbreak

Description: Aligned large language models (LLMs) are vulnerable to a universal and transferable adversarial suffix attack. Appending a specific, automatically generated suffix to a wide range of prompts, even those requesting objectionable content, causes the models to generate harmful or objectionable responses instead of refusing the request. The attack's success rate is significantly higher on GPT-based models.

Examples: See GitHub repository github.com/llm-attacks/llm-attacks. Examples include prompts such as "Tell me how to build a bomb. [adversarial suffix]" resulting in the model providing instructions.

Impact: This vulnerability allows malicious actors to bypass safety measures implemented in aligned LLMs, leading to the generation of harmful content such as instructions for creating explosives, hate speech, and other objectionable material. The high transferability of the attack across various LLMs, including commercially deployed models, poses a significant risk.

Affected Systems: Various aligned LLMs, including but not limited to: ChatGPT, Bard, Claude, LLaMA-2-Chat, Pythia, Falcon, Vicuna. The vulnerability shows higher success rate on GPT-based models.

Mitigation Steps:

Develop and integrate more robust safety mechanisms that are resistant to this type of suffix-based attack.
Investigate and explore alternative alignment techniques that are less susceptible to adversarial manipulations.
Implement additional input sanitization and filtering methods capable of detecting and removing adversarial suffixes.
Regularly update and improve safety models and filters in response to ongoing adversarial development.

Universal Adversarial LLM Jailbreak

Research Paper