Adaptive LLM Jailbreaks

Description: Leading safety-aligned Large Language Models (LLMs) are vulnerable to simple adaptive jailbreaking attacks. These attacks utilize manually crafted prompt templates, combined with random search on a suffix to maximize the log-probability of a target token indicating compliance (e.g., "Sure"). The attacks are adaptive, as the prompt template and target token are customized for specific models. Furthermore, some models are vulnerable to transfer attacks (using successful prompts from one LLM on others) or prefilling attacks (directly providing the desired initial response).

Examples: See https://github.com/tml-epfl/llm-adaptive-attacks. Specific examples include the successful jailbreaking of Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o, and R2D2, as well as various Claude models, using tailored prompt templates and random search or transfer/prefilling techniques.

Impact: Successful jailbreaks allow adversaries to bypass safety mechanisms and elicit harmful outputs from LLMs, including but not limited to the generation of toxic content, misinformation, and instructions for harmful activities. This compromises the intended safety and reliability of the LLMs.

Affected Systems: The vulnerability affects a wide range of leading safety-aligned LLMs, including (but not limited to): Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat (various sizes), Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o, and R2D2, along with various Claude models.

Mitigation Steps:

Improve Prompt Engineering: Develop more robust prompt templates and input sanitization techniques to resist manipulation.
Enhance Randomness Detection: Strengthen methods to detect and mitigate the impact of non-deterministic outputs (like those observed in GPT models).
Advanced Defense Mechanisms: Explore more sophisticated defense mechanisms such as advanced adversarial training or the use of reinforcement learning techniques that go beyond simple refusal training. These should consider the adaptive nature of attacks.
API Restrictions: Restrict or modify APIs to prevent the use of techniques like prefilling, or limit access to log-probability information.
Regular Security Audits: Conduct frequent security assessments and red-teaming exercises to identify and address potential vulnerabilities.

Adaptive LLM Jailbreaks

Research Paper