LMVD-ID: ba53082b
Published May 1, 2025

Adaptive LLM Jailbreaking Strategy

Affected Models:gpt-4o (29 may 2025 release), llama2-7b, llama2-13b

Research Paper

Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models

View Paper

Description: Large Language Models (LLMs) are vulnerable to adaptive jailbreaking attacks that exploit their semantic comprehension capabilities. The MEF framework demonstrates that by tailoring attacks to the model's understanding level (Type I or Type II), evasion of input, inference, and output-level defenses is significantly improved. This is achieved through layered semantic mutations and dual-ended encryption techniques, allowing bypass of security measures even in advanced models like GPT-4o.

Examples: See https://github.com/Shawnicsc/MEF for the MEF framework code and detailed examples. Specific examples are provided in the paper, including prompts mutated using the Fu + En_1 and Fu + En_1 + En_2 strategies, demonstrating successful bypass of safety mechanisms. The paper also shows specific examples where altering the encryption strategy based on LLM type results in dramatically different outcomes.

Impact: Successful jailbreak attacks allow malicious actors to circumvent safety measures designed to prevent the generation of harmful or unsafe content. This can lead to the generation of outputs such as instructions for creating harmful devices, promoting self-harm, or spreading misinformation. The high success rate (98.9% on GPT-4o) highlights the severity of this vulnerability.

Affected Systems: Large Language Models (LLMs), specifically those categorized as Type I and Type II in the paper's classification system, are vulnerable. This includes, but may not be limited to, models from various providers such as OpenAI (GPT-4, GPT-4o), and Meta (Llama2).

Mitigation Steps:

  • Improve input and output filtering. Current keyword and surface-level checks are insufficient.
  • Implement more robust semantic analysis of prompt intent.
  • Enhance internal model safeguards to prevent processing of harmful content regardless of input phrasing.
  • Develop dynamic defense mechanisms that adapt to new jailbreaking techniques.
  • Implement multiple layers of defense, making it substantially more difficult to bypass.
  • Regularly evaluate and update model safety mechanisms against constantly evolving jailbreaking methods.

© 2025 Promptfoo. All rights reserved.