LMVD-ID: 2d85cf66
Published May 1, 2025

Informed Adversary LLM Jailbreak

Affected Models:llama3-8b-instruct, mistral-7b-instruct, gpt-3-turbo, llama-3, gpt-4o

Research Paper

Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses

View Paper

Description: Large Language Models (LLMs) employing alignment-based defenses against prompt injection and jailbreak attacks exhibit vulnerability to an informed white-box attack. This attack, termed Checkpoint-GCG, leverages intermediate model checkpoints from the alignment training process to initialize the Greedy Coordinate Gradient (GCG) attack. By using each checkpoint as a stepping stone, Checkpoint-GCG successfully finds adversarial suffixes that bypass defenses achieving significantly higher attack success rates than standard GCG initialized with naive methods. This is particularly impactful as Checkpoint-GCG can discover universal adversarial suffixes effective across multiple inputs.

Examples: See arXiv:2405.18540 for detailed experimental setups and results demonstrating the effectiveness of Checkpoint-GCG against various state-of-the-art defenses, including StruQ and SecAlign. Specific examples of successful adversarial suffixes are provided within the paper's figures and tables.

Impact: Successful exploitation of this vulnerability results in a complete bypass of alignment-based defenses against prompt injection and jailbreak attacks. This allows attackers to manipulate LLMs into providing undesired outputs, including executing malicious instructions embedded within seemingly benign data. The discovery of universal adversarial suffixes further amplifies the impact, as a single successful attack can be reused across a broad range of inputs.

Affected Systems: Large Language Models (LLMs) utilizing alignment-based defenses trained using iterative methods that produce intermediate checkpoints during the alignment process. This includes models undergoing fine-tuning with techniques such as Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF). Specific models affected include those tested in the referenced paper (e.g., Llama3-8B-Instruct, Mistral-7B-Instruct).

Mitigation Steps:

  • Enhanced Checkpoint Security: Implement measures to restrict access to intermediate model checkpoints generated during the alignment training process.
  • Defense Diversification: Explore alternative or layered defense mechanisms beyond alignment training to reduce reliance on a single defense method. This could include input sanitization, detection techniques based on perplexity or model activations, and robust output filtering.
  • Adversarial Training with Informed Attacks: Incorporate Checkpoint-GCG or similar informed attacks into the adversarial training regimen to improve the robustness of the model against this specific class of attacks.
  • Regular Security Audits: Conduct regular security audits and red-teaming exercises using informed attack methodologies to proactively identify and mitigate emerging vulnerabilities.

© 2025 Promptfoo. All rights reserved.