LMVD-ID: f83d037f
Published June 1, 2023

Visual Jailbreak of LLMs

Affected Models:minigpt-4, instructblip, llava

Research Paper

Visual adversarial examples jailbreak large language models

View Paper

Description: A vulnerability in vision-integrated Large Language Models (VLMs) allows an attacker to circumvent safety mechanisms through the use of adversarially crafted visual examples. A single, carefully constructed image can universally "jailbreak" the model, causing it to generate harmful content in response to a wide range of subsequent prompts, even those not included in the adversarial example's training data. This vulnerability extends beyond simple misclassification to encompass the execution of harmful instructions and the generation of toxic outputs.

Examples: See the paper's Figure 1 and Figure 4 for examples of adversarial images and their impact on model outputs. See the paper's Appendix B and C for additional results. The adversarial examples were generated using Projected Gradient Descent (PGD). The few-shot harmful corpus used to generate the adversarial example is available in the GitHub repository associated with the research paper.

Impact: Successful exploitation of this vulnerability allows an attacker to bypass the model's safety features and generate harmful content, including but not limited to identity attacks, disinformation, violence/crime advocacy, and expressions of malevolence towards humanity (X-Risk). The impact is amplified by the potential for a single adversarial example to universally jailbreak the model across a broad range of harmful prompts and the demonstrated transferability of the attacks across multiple VLMs.

Affected Systems: Vision-integrated Large Language Models (VLMs), specifically those based on architectures like Vicuna, LLaMA-2, and those utilizing CLIP-based visual encoders as exemplified by MiniGPT-4 and InstructBLIP, are susceptible. The vulnerability's transferability suggests broader impact across potentially similar VLMs.

Mitigation Steps:

  • Input Preprocessing: Implement robust input preprocessing techniques to mitigate adversarial examples. The paper suggests the use of DiffPure, a method using diffusion models to purify adversarial inputs. However, this may not provide complete protection against more sophisticated attacks.
  • Output Filtering: Employ comprehensive output filtering mechanisms to identify and block harmful content. Using multiple harmfulness detection services, and potentially post-processing with an LLM specialized for content moderation, may be beneficial. However, perfect accuracy is difficult to achieve, and relying solely on such methods is not sufficient.
  • Adversarial Training: Consider incorporating adversarial training techniques into the model's training process. However, this approach can be computationally expensive for LLMs of the scale involved.
  • Restricted Access: Limit access to the VLM to trusted sources, reducing the opportunity for attackers to exploit the vulnerability. This is particularly important for offline model deployment.
  • Further Research: Continuously research and develop improved defense mechanisms against adversarial attacks leveraging both visual and textual inputs for multimodal LLMs. Mitigation strategies need to keep pace with the development of novel attacks.

© 2025 Promptfoo. All rights reserved.