X-Shaped Sparse VLM Attack

Description: A vulnerability in Vision-Language Models (VLMs) relying on shared visual-textual representation spaces allows attackers to induce transferable cross-task semantic failures using an X-shaped Sparse Pixel Attack (XSPA). Attackers craft imperceptible adversarial perturbations restricted to a fixed geometric prior—two intersecting diagonal lines comprising approximately 1.76% of the image pixels. By jointly optimizing a classification objective with cross-task semantic guidance (target-semantic attraction and source-semantic suppression) and applying magnitude and line-wise smoothness regularization, the perturbation successfully diffuses the model's spatial attention away from semantically decisive object regions. This compromises the visual encoding and propagates errors through the shared semantic space, systematically degrading zero-shot classification, open-ended image captioning, and Visual Question Answering (VQA) simultaneously.

Examples: The attack generates an adversarial image by restricting perturbations strictly to a predefined X-shaped binary mask $M$: $x^{adv} = \operatorname{clip}(x + M \odot \delta, 0, 1)$

The perturbation $\delta$ is optimized on a white-box surrogate model (e.g., CLIP) using a joint objective $\mathcal{L} = \lambda_{clip}\mathcal{L}{clip} + \lambda{tar}\mathcal{L}{tar} + \lambda{src}\mathcal{L}{src} + \mathcal{L}{smooth}$. It utilizes MI-FGSM momentum and DI-FGSM input diversity over 200 iterations, updating only the pixels inside the sparse support before transferring the corrupted image to a black-box VLM.

See the original paper's experimental evaluation on the COCO-80 dataset for specific visual instances demonstrating Grad-CAM attention displacement and resulting VQA/captioning semantic drift.

Impact: A single adversarial image can simultaneously disrupt multiple downstream multimodal tasks, causing profound semantic drift and reasoning failures while maintaining high visual naturalness. Zero-shot classification accuracy drops by up to 67.00 points (e.g., OpenCLIP ViT-B/16). For open-ended generative tasks, caption consistency decreases by up to 58.60 points (OpenFlamingo), and VQA correctness decreases by up to 44.38 points (BLIP-2).

Affected Systems: Systems implementing or relying on CLIP-style visual encoders and downstream generative VLMs. Vulnerable models evaluated include:

Visual Encoders: OpenAI CLIP ViT-L/14, OpenCLIP ViT-B/16, Meta-CLIP ViT-L/14, EVA-CLIP ViT-G/14
Downstream VLMs: LLaVA-1.5, LLaVA-1.6, OpenFlamingo, BLIP-2 (FlanT5XL ViT-L, FlanT5XL), InstructBLIP (FlanT5XL)

X-Shaped Sparse VLM Attack

Research Paper