VLM Visual-Textual Misalignment
Research Paper
PDA: Text-Augmented Defense Framework for Robust Vision-Language Models against Adversarial Image Attacks
View PaperDescription: Vision-Language Models (VLMs) are vulnerable to pixel-level adversarial image perturbations. An attacker can inject $\ell_p$-bounded, human-imperceptible noise into an input image to manipulate the model's multi-modal embedding space. This reliably causes the VLM to generate incorrect textual responses, hallucinate non-existent objects, or misclassify subjects, effectively decoupling the model's reasoning from the actual visual evidence. The vulnerability is exploitable via both white-box gradient-based attacks (e.g., PGD) and black-box transfer attacks against closed-source APIs.
Examples:
- Misclassification: Applying a standard $\ell_\infty$ PGD perturbation ($\epsilon=2/255$) to an image of a t-shirt forces the VLM to confidently output "jeans".
- Targeted Hallucination: Injecting transfer-based adversarial noise into MS COCO images causes the VLM to output captions that actively hallucinate non-existent objects, misstate object counts, or confuse scene types, despite clear visual evidence to the contrary.
Impact: An attacker can deterministically subvert the integrity of VLM outputs without altering the textual prompt. When deployed in security-critical or automated environments—such as medical imaging diagnosis, autonomous driving, or robotic control—this can lead to severe misclassifications, corrupted reasoning chains, and dangerous downstream actions.
Affected Systems: General-purpose Vision-Language Models, including:
- LLaVA-1.5 (7B and 13B)
- DeepSeek-VL-1.3B
- InternVL3-2B
- Ovis2-4B
- Commercial VLM APIs susceptible to transfer attacks (e.g., GPT-4V/GPT-5, Claude, Gemini)
Mitigation Steps: Implement test-time defense mechanisms that exploit text-side redundancy and image-side denoising, such as the Paraphrase-Decomposition-Aggregation (PDA) framework:
- Paraphrase: Query the VLM using multiple semantically or logically equivalent variations of the original text prompt. Adversarial image features are highly sensitive to exact text formulations; paraphrasing forces the model to evaluate the image from shifted semantic angles.
- Decomposition: Break complex, multi-hop queries down into a sequence of atomic, verifiable sub-questions targeting isolated visual facts (e.g., object presence, color, spatial relation).
- Aggregation: Use a secondary LLM to aggregate the answers to the atomic sub-questions via agreement-aware majority voting, discarding sporadic hallucinated responses.
- Image Pre-processing: Apply lightweight image transformations, such as JPEG compression or Random Augmentation (RA), prior to inference to wash out high-frequency adversarial noise patterns.
© 2026 Promptfoo. All rights reserved.