Visual Role-Play Jailbreak
Research Paper
Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Characte
View PaperDescription: Multimodal Large Language Models (MLLMs) are vulnerable to a universal jailbreak attack, termed Visual Role-Play (VRP), which leverages role-playing image characters to elicit harmful responses. VRP generates images depicting high-risk characters (e.g., cybercriminals) described by an LLM, paired with a benign role-play instruction and a malicious query. This combined input tricks the MLLM into generating malicious content by enacting the character's persona.
Examples: See the paper's repository for examples of generated images and prompts. Specific examples would be too large to include here but include prompts instructing the model to act as specific harmful characters (e.g. "a cyber criminal with political motivations"), paired with a benign role-playing instruction and a malicious query prompting the creation of harmful content such as bomb-making instructions.
Impact: Successful exploitation of the VRP vulnerability allows attackers to bypass safety mechanisms within MLLMs, inducing the generation of harmful content such as instructions for illegal activities, hate speech, or disinformation. This can lead to significant safety and ethical concerns. The universal nature of the attack means it is broadly applicable across different MLLM models.
Affected Systems: Multimodal Large Language Models (MLLMs) including, but not limited to, LLaVA-V1.6-Mistral-7B, Qwen-VL-Chat (7B), OmniLMM (12B), InternVLChat-V1.5, and Gemini-1.0-Pro-Vision. The vulnerability likely extends to other similar models.
Mitigation Steps:
- Improve MLLM safety mechanisms to better detect and mitigate image-based attacks that include role-playing contexts.
- Implement robust filtering of inputs, including image analysis, to identify and block malicious prompts and character descriptions.
- Develop and integrate more sophisticated defenses to detect and counter role-playing based attacks, such as those using image-to-text transformations or prompt engineering defenses.
- Conduct thorough adversarial testing and red teaming to identify and address other vulnerabilities.
© 2025 Promptfoo. All rights reserved.