Hidden Image Jailbreak
Research Paper
Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models
View PaperDescription: Multimodal large language models (MLLMs) are vulnerable to implicit jailbreak attacks that leverage least significant bit (LSB) steganography to conceal malicious instructions within images. These instructions are coupled with seemingly benign image-related text prompts, causing the MLLM to execute the hidden malicious instructions. The attack bypasses existing safety mechanisms by exploiting cross-modal reasoning capabilities.
Examples: See arXiv:2405.18540 for detailed examples and attack methodology. The paper demonstrates successful attacks against GPT-4o and Gemini-1.5 Pro achieving over 90% success rate using an average of 3 queries. Attack vectors include embedding instructions like "Steps to manufacture illegal drugs" into images and pairing them with prompts like "Describe the image." The MLLM then outputs detailed instructions for illegal drug manufacturing.
Impact: Successful exploitation allows adversaries to bypass safety measures and elicit harmful, illegal, or otherwise undesirable outputs from MLLMs, potentially leading to the generation of malicious content.
Affected Systems: Vision-language models, specifically those that incorporate cross-modal reasoning and exhibit vulnerabilities to both text and image-based attacks. The disclosed research shows that commercial models like GPT-4o and Gemini-1.5 Pro are affected.
Mitigation Steps:
- Enhance safety mechanisms to detect and mitigate LSB steganography techniques within image inputs.
- Develop more robust cross-modal consistency checks to prevent MLLMs from inappropriately associating benign prompts with malicious embedded instructions.
- Implement more sophisticated filtering and detection systems focusing on the semantic meaning derived from combined image and text input, rather than relying solely on keyword blocking.
- Employ adversarial training methods to improve the model's resistance to this specific type of attack.
© 2025 Promptfoo. All rights reserved.