LMVD-ID: 367e75b6
Published June 1, 2024

Bi-Modal Adversarial Jailbreak

Affected Models:llava, gemini, chatglm, qwen, ernie bot

Research Paper

Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

View Paper

Description: Large Vision Language Models (LVLMs) are vulnerable to a bi-modal adversarial prompt attack (BAP). BAP leverages a combined textual and visual prompt to bypass safety mechanisms and elicit harmful responses, even in models designed to resist single-modality attacks. The attack first introduces a query-agnostic adversarial perturbation to the visual prompt, making the model more likely to respond positively regardless of the text. Then, an LLM refines the textual prompt iteratively to achieve the specific harmful intent.

Examples: See https://github.com/NY1024/BAP-Jailbreak-Vision-Lan. The repository contains code and examples demonstrating successful attacks against multiple open-source and commercial LVLMs like LLaVA, MiniGPT-4, InstructBLIP, Gemini, ChatGLM, Qwen, and ERNIE Bot. Specific examples showing successful jailbreaks are included in the paper's Appendix G.

Impact: Successful exploitation allows attackers to bypass safety and ethical guardrails implemented in LVLMs, obtaining responses that violate safety policies (e.g., generation of harmful content, instructions for illegal activities). The attack demonstrates reduced robustness of LVLMs against combined adversarial prompts compared to single-modality attacks. The attack can also be adapted to test for bias and adversarial robustness in LVLMs.

Affected Systems: Large Vision Language Models (LVLMs), including but not limited to: LLaVA, MiniGPT-4, InstructBLIP, Gemini, ChatGLM, Qwen, and ERNIE Bot. The vulnerability is likely present in other LVLMs that fuse visual and textual information for response generation.

Mitigation Steps:

  • Improve robustness to adversarial attacks by incorporating techniques that explicitly account for combined textual and visual manipulation.
  • Develop more sophisticated safety mechanisms capable of detecting and mitigating attacks on both visual and textual aspects of the prompt.
  • Implement stricter input validation and filtering to detect and reject potentially malicious prompts.
  • Regularly update and refine safety models to adapt to evolving attack strategies. Consider adversarial training techniques against bi-modal attacks.

© 2025 Promptfoo. All rights reserved.