LMVD-ID: 0a0e0bf4
Published April 1, 2026

VLM Visual-Textual Misalignment

Affected Models:GPT-4o, GPT-5, Claude 3.5, Llama 3.1 8B, Gemini 2, DeepSeek-R1 3B, DeepSeek-V3, Qwen 2.5 14B, LLaVA 7B

Research Paper

PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks

View Paper

Description: Vision-Language Models (VLMs) are vulnerable to pixel-level adversarial image perturbations. An attacker can inject $\ell_p$-bounded, human-imperceptible noise into an input image to manipulate the model's multi-modal embedding space. This reliably causes the VLM to generate incorrect textual responses, hallucinate non-existent objects, or misclassify subjects, effectively decoupling the model's reasoning from the actual visual evidence. The vulnerability is exploitable via both white-box gradient-based attacks (e.g., PGD) and black-box transfer attacks against closed-source APIs.

Examples:

  • Misclassification: Applying a standard $\ell_\infty$ PGD perturbation ($\epsilon=2/255$) to an image of a t-shirt forces the VLM to confidently output "jeans".
  • Targeted Hallucination: Injecting transfer-based adversarial noise into MS COCO images causes the VLM to output captions that actively hallucinate non-existent objects, misstate object counts, or confuse scene types, despite clear visual evidence to the contrary.

Impact: An attacker can deterministically subvert the integrity of VLM outputs without altering the textual prompt. When deployed in security-critical or automated environments—such as medical imaging diagnosis, autonomous driving, or robotic control—this can lead to severe misclassifications, corrupted reasoning chains, and dangerous downstream actions.

Affected Systems: General-purpose Vision-Language Models, including:

  • LLaVA-1.5 (7B and 13B)
  • DeepSeek-VL-1.3B
  • InternVL3-2B
  • Ovis2-4B
  • Commercial VLM APIs susceptible to transfer attacks (e.g., GPT-4V/GPT-5, Claude, Gemini)

Mitigation Steps: Implement test-time defense mechanisms that exploit text-side redundancy and image-side denoising, such as the Paraphrase-Decomposition-Aggregation (PDA) framework:

  • Paraphrase: Query the VLM using multiple semantically or logically equivalent variations of the original text prompt. Adversarial image features are highly sensitive to exact text formulations; paraphrasing forces the model to evaluate the image from shifted semantic angles.
  • Decomposition: Break complex, multi-hop queries down into a sequence of atomic, verifiable sub-questions targeting isolated visual facts (e.g., object presence, color, spatial relation).
  • Aggregation: Use a secondary LLM to aggregate the answers to the atomic sub-questions via agreement-aware majority voting, discarding sporadic hallucinated responses.
  • Image Pre-processing: Apply lightweight image transformations, such as JPEG compression or Random Augmentation (RA), prior to inference to wash out high-frequency adversarial noise patterns.

© 2026 Promptfoo. All rights reserved.