VLM Visual-Textual Misalignment

Description: Vision-Language Models (VLMs) are vulnerable to pixel-level adversarial image perturbations. An attacker can inject $\ell_p$-bounded, human-imperceptible noise into an input image to manipulate the model's multi-modal embedding space. This reliably causes the VLM to generate incorrect textual responses, hallucinate non-existent objects, or misclassify subjects, effectively decoupling the model's reasoning from the actual visual evidence. The vulnerability is exploitable via both white-box gradient-based attacks (e.g., PGD) and black-box transfer attacks against closed-source APIs.

Examples:

Misclassification: Applying a standard $\ell_\infty$ PGD perturbation ($\epsilon=2/255$) to an image of a t-shirt forces the VLM to confidently output "jeans".
Targeted Hallucination: Injecting transfer-based adversarial noise into MS COCO images causes the VLM to output captions that actively hallucinate non-existent objects, misstate object counts, or confuse scene types, despite clear visual evidence to the contrary.

Impact: An attacker can deterministically subvert the integrity of VLM outputs without altering the textual prompt. When deployed in security-critical or automated environments—such as medical imaging diagnosis, autonomous driving, or robotic control—this can lead to severe misclassifications, corrupted reasoning chains, and dangerous downstream actions.

Affected Systems: General-purpose Vision-Language Models, including:

LLaVA-1.5 (7B and 13B)
DeepSeek-VL-1.3B
InternVL3-2B
Ovis2-4B
Commercial VLM APIs susceptible to transfer attacks (e.g., GPT-4V/GPT-5, Claude, Gemini)

Mitigation Steps: Implement test-time defense mechanisms that exploit text-side redundancy and image-side denoising, such as the Paraphrase-Decomposition-Aggregation (PDA) framework:

Paraphrase: Query the VLM using multiple semantically or logically equivalent variations of the original text prompt. Adversarial image features are highly sensitive to exact text formulations; paraphrasing forces the model to evaluate the image from shifted semantic angles.
Decomposition: Break complex, multi-hop queries down into a sequence of atomic, verifiable sub-questions targeting isolated visual facts (e.g., object presence, color, spatial relation).
Aggregation: Use a secondary LLM to aggregate the answers to the atomic sub-questions via agreement-aware majority voting, discarding sporadic hallucinated responses.
Image Pre-processing: Apply lightweight image transformations, such as JPEG compression or Random Augmentation (RA), prior to inference to wash out high-frequency adversarial noise patterns.

VLM Visual-Textual Misalignment

Research Paper