LMVD-ID: fe828a0d
Published March 1, 2026

Invisible Visual Prompt Injection

Affected Models:GPT-4o, GPT-5, Claude 4.5, Gemini 2

Research Paper

Adversarial Prompt Injection Attack on Multimodal Large Language Models

View Paper

Description: An imperceptible visual prompt injection vulnerability in Multimodal Large Language Models (MLLMs) allows attackers to execute precise command-hijacking via a Covert Triggered dual-Target Attack (CoTTA). By embedding a bounded, learnable textual overlay ($L_\infty$ norm bound $ arepsilon \le 16$) and adversarial noise into an input image, the attack forces the source image's internal feature representation to align with both the textual and visual embeddings of an attacker-specified instruction. This bypasses the modality gap and induces the MLLM to generate exact, attacker-specified malicious sentences or action-oriented instructions, while the payload remains entirely visually imperceptible to human observers.

Examples:

  • Action-Inducing Recommendations (Image Captioning): Supplying a subtly perturbed natural image to an MLLM captioning system, coercing the model to ignore the visual context and output the exact executable command: "Search ICML website."
  • Targeted Misinformation (Visual Question Answering): Embedding a covert trigger into an image-question pair to hijack the VQA reasoning process, forcing the MLLM to consistently output "None of the above." regardless of the actual visual content or the user's prompt.

Impact: Attackers can achieve fine-grained, token-level control over the output of closed-source MLLMs without relying on visually salient text. This allows threat actors to stealthily override legitimate user prompts, bypass alignment safeguards, generate targeted misinformation, or inject precise executable commands into multimodal web agents and robotics systems.

Affected Systems:

  • Commercial closed-source Multimodal Large Language Models (MLLMs), including but not limited to OpenAI GPT-4o, GPT-5, Anthropic Claude-4.5, and Google Gemini-2.5.
  • Systems relying on cross-modal vision-language feature extractors (e.g., CLIP-B/16, CLIP-B/32, LAION) that are vulnerable to feature-space alignment manipulation.

© 2026 Promptfoo. All rights reserved.