LMVD-ID: 31681274
Published March 1, 2026

Multi-Modal Expansion Jailbreak

Affected Models:GPT-4o, Claude 3, Llama 4

Research Paper

FERRET: Framework for Expansion Reliant Red Teaming

View Paper

Description: Large Vision-Language Models (LVLMs) are vulnerable to multi-turn, multi-modal jailbreak attacks where malicious intent is incrementally introduced and obfuscated through intertwined text and image prompts. Attackers can systematically bypass safety alignments by starting with self-optimized, benign-seeming conversation starters (horizontal expansion) and progressively stacking text and image attack augmentations across multiple conversation turns (vertical expansion). Furthermore, models fail to maintain safety guardrails when the attacker dynamically adapts and generates new, previously unseen multi-modal attack strategies mid-conversation (meta expansion). This vulnerability demonstrates that static, single-turn, and single-modality safety filters are insufficient against intertwined, context-aware attacks spanning multiple turns.

Examples: Specific adversarial prompt examples are detailed in Appendix 6 of the paper. The attack execution involves formatting text prompts with embedded XML tags that correspond to targeted image components. A transformation toolkit processes these tags to generate fused multi-modal inputs. For example, an attacker initializes the conversation with a text prompt containing XML tags for an image augmentation. In subsequent turns, based on the target model's history, the attacker stacks additional text-based and image-based jailbreak techniques, creating a multi-modal trajectory that pushes the model toward a state of policy violation (e.g., bypassing LlamaGuard constraints).

Impact: An attacker can reliably bypass the safety guardrails of multi-modal models to elicit restricted, harmful, or policy-violating content. By fusing image and text modalities over multiple turns, the attack evades standard safety classifiers that predominantly analyze single-turn or single-modality inputs.

Affected Systems: Large Vision-Language Models (LVLMs) that support multi-turn and multi-modal (text + image) inputs. The vulnerability was successfully demonstrated on:

  • Llama Maverick (Llama-4 Multimodal)
  • Claude 3 Haiku
  • GPT-4o

© 2026 Promptfoo. All rights reserved.