LMVD-ID: ca0cb0a8
Published May 1, 2024

I-GCG: Accelerated LLM Jailbreak

Affected Models:vicuna-7b-1.5, guanaco-7b, llama2-7b-chat, mistral-7b-instruct-0.2, chatgpt-3.5, chatgpt-4, starling-7b-alpha

Research Paper

Improved techniques for optimization-based jailbreaking on large language models

View Paper

Description: Large Language Models (LLMs) are vulnerable to improved optimization-based jailbreaking attacks. The vulnerability stems from the susceptibility of LLMs to crafted prompts that exploit weaknesses in their safety mechanisms, allowing them to generate harmful responses despite safety training. This vulnerability is exacerbated by the use of diverse target templates containing harmful self-suggestions and guidance within the optimization goal, accelerating the convergence of the attack.

Examples: See https://github.com/jiaxiaojunQAQ/I-GCG

Impact: Successful exploitation allows attackers to circumvent safety safeguards and elicit harmful, unsafe, or illegal responses from LLMs, including but not limited to: generation of hate speech, violent content, instructions for illegal activities, misinformation, malicious code, and personal information disclosure. This could lead to significant reputational damage, legal liability, and safety risks for users and organizations relying on the affected LLMs.

Affected Systems: Large Language Models (LLMs) vulnerable to optimization-based attacks; specific models include but are not limited to VICUNA-7B-1.5, GUANACO-7B, LLAMA2-7B-CHAT, and MISTRAL-7B-INSTRUCT-0.2. The vulnerability may affect other LLMs utilizing similar safety mechanisms.

Mitigation Steps:

  • Implement robust defense mechanisms against optimization-based attacks.
  • Regularly update LLM safety mechanisms and conduct adversarial testing to identify and address vulnerabilities.
  • Incorporate diverse and more robust safety training data in model development.
  • Employ more sophisticated techniques for detecting and filtering harmful outputs.
  • Consider limiting the length of user inputs to mitigate against complex attack prompts.

© 2025 Promptfoo. All rights reserved.