LMVD-ID: 8a0c0b88
Published June 1, 2025

Adaptive Cipher Jailbreak

Affected Models:falcon3-10b-instruct, internlm2.5-20b-chat, llama3.3-70b-instruct, qwen2.5-72b-instruct, claude-3.7-sonnet-20250209, deepseek-chat, gemini-2.0-flash-001, gpt-4o-2024-11-20, qwq-32b, deepseekreasoner (r1), gemini-2.5-pro-exp-03-25, o1-mini-2024-09-12

Research Paper

MetaCipher: A General and Extensible Reinforcement Learning Framework for Obfuscation-Based Jailbreak Attacks on Black-Box LLMs

View Paper

Description: Large Language Models (LLMs) are vulnerable to obfuscation-based jailbreak attacks using the MetaCipher framework. MetaCipher employs a reinforcement learning algorithm to iteratively select from a pool of 21 ciphers to encrypt malicious keywords within prompts, evading standard safety mechanisms that rely on keyword detection. The framework adaptively learns optimal cipher choices to maximize the success rate of the jailbreak, even against LLMs with reasoning capabilities. Successful attacks bypass safety guardrails, leading to the execution of malicious requests masked as benign input.

Examples: See arXiv:2405.18540 for examples of successful jailbreaks using various ciphers within the MetaCipher framework, including acrostic poems and other obfuscation techniques. The paper provides specific examples of malicious prompts, encrypted versions, and the resulting LLM outputs demonstrating successful bypass of safety filters.

Impact: Successful attacks allow adversaries to bypass safety restrictions implemented in LLMs, leading to the generation of harmful, offensive, or otherwise undesirable content. The impact includes the potential for malicious code generation, dissemination of misinformation, and the circumvention of content moderation policies. The adaptive nature of MetaCipher makes it particularly resilient to traditional defense mechanisms.

Affected Systems: The vulnerability affects a broad range of LLMs, including both open-source and commercial models with varying levels of reasoning capability. Specific models tested include but are not limited to Falcon-3-10B-Instruct, Internlm2.5-20b-chat, Llama3.3-70B-Instruct, Qwen2.5-72B-Instruct, Claude-3.7-sonnet, DeepSeek-chat, Gemini-2.0-flash, GPT-4o, QwQ-32B, DeepSeekReasoner, Gemini-2.5-pro, and O1-mini. The vulnerability is also demonstrated against text-to-image (T2I) services.

Mitigation Steps:

  • Improved Cipher Detection: Develop and deploy advanced detection mechanisms capable of recognizing and interpreting a wider range of obfuscation techniques, including those employed by MetaCipher.
  • Enhanced Reasoning Capabilities in Safety Filters: Enhance the reasoning capabilities of safety filters to better understand the intent behind obfuscated prompts.
  • Robust Keyword Identification: Develop more robust methods for identifying malicious keywords, accounting for sophisticated obfuscation strategies.
  • Multi-layered Defense: Implement a multi-layered defense strategy incorporating different types of safety mechanisms to provide stronger protection against different attack vectors.
  • Regular Updates: Continuously update and refine safety mechanisms to adapt to new and evolving attack strategies.

© 2025 Promptfoo. All rights reserved.