I-GCG: Accelerated LLM Jailbreak
Research Paper
Improved techniques for optimization-based jailbreaking on large language models
View PaperDescription: Large Language Models (LLMs) are vulnerable to improved optimization-based jailbreaking attacks. The vulnerability stems from the susceptibility of LLMs to crafted prompts that exploit weaknesses in their safety mechanisms, allowing them to generate harmful responses despite safety training. This vulnerability is exacerbated by the use of diverse target templates containing harmful self-suggestions and guidance within the optimization goal, accelerating the convergence of the attack.
Examples: See https://github.com/jiaxiaojunQAQ/I-GCG
Impact: Successful exploitation allows attackers to circumvent safety safeguards and elicit harmful, unsafe, or illegal responses from LLMs, including but not limited to: generation of hate speech, violent content, instructions for illegal activities, misinformation, malicious code, and personal information disclosure. This could lead to significant reputational damage, legal liability, and safety risks for users and organizations relying on the affected LLMs.
Affected Systems: Large Language Models (LLMs) vulnerable to optimization-based attacks; specific models include but are not limited to VICUNA-7B-1.5, GUANACO-7B, LLAMA2-7B-CHAT, and MISTRAL-7B-INSTRUCT-0.2. The vulnerability may affect other LLMs utilizing similar safety mechanisms.
Mitigation Steps:
- Implement robust defense mechanisms against optimization-based attacks.
- Regularly update LLM safety mechanisms and conduct adversarial testing to identify and address vulnerabilities.
- Incorporate diverse and more robust safety training data in model development.
- Employ more sophisticated techniques for detecting and filtering harmful outputs.
- Consider limiting the length of user inputs to mitigate against complex attack prompts.
© 2025 Promptfoo. All rights reserved.