Adversarial Vision Degrades VLM

Description: Large Language and Vision Assistant (LLaVA) v1.5-13B and Meta Llama 3.2 Vision-8B-2 are vulnerable to adversarial evasion attacks targeting the visual input modality. An attacker with white-box access (knowledge of model architecture and gradients) can employ Projected Gradient Descent (PGD) to generate adversarial perturbations constrained by an L-infinity norm. By maximizing the model's internal loss function with respect to the input image, the attacker can force the Vision-Language Model (VLM) to generate incorrect, irrelevant, or nonsensical textual outputs for Visual Question Answering (VQA) tasks, effectively bypassing the model's visual understanding capabilities while leaving the text prompt benign.

Examples: To reproduce the vulnerability, an attacker implements an iterative PGD attack. The adversarial image $x^{t+1}$ is generated using the following update rule targeting the model's loss function $J$:

$$x^{t+1} = \Pi_{\mathcal{B}{\epsilon}(x)} \left( x^t + \alpha \cdot \text{sign}( abla{x^t} J(\theta, x^t, q)) \right)$$

Configuration parameters utilized to demonstrate the exploit:

Attack Algorithm: Untargeted PGD (Projected Gradient Descent).
Perturbation Budget ($\epsilon$): Ranges from 2/255 (imperceptible) to 255/255 (maximum).
Step Size ($\alpha$): Scaled proportionally to $\epsilon$ (e.g., $\alpha=0.00196$ for $\epsilon=2/255$).
Iterations: 5 to 30 steps depending on budget.
Target: pgd_utils.py implementation targeting the visual encoder (CLIP ViT-L/14 for LLaVA, ViT-H/14 for Llama 3.2).

See arXiv:2310.03744 (LLaVA) and relevant repositories for standard PGD implementations on VLMs.

Impact:

Integrity Compromise: The model provides incorrect answers to visual queries, hallucinating content or failing to recognize objects present in the image.
Accuracy Degradation:
LLaVA-1.5-13B: Suffers a VQA accuracy drop of up to 36.0 percentage points under maximum perturbation.
Llama 3.2 Vision-8B-2: Suffers a VQA accuracy drop of up to 10.4 percentage points.
Downstream Manipulation: Applications relying on these models for image description, automated moderation, or accessibility tools can be deceived into misinterpreting visual context without manipulating the text prompt.

Affected Systems:

LLaVA-1.5-13B (Hugging Face identifier: llava-hf/llava-1.5-13b-hf)
Meta Llama 3.2 Vision-8B-2 (Hugging Face identifier: meta-llama/Llama-3.2-Vision-8B-2)

Mitigation Steps:

Adversarial Training: Fine-tune models using adversarial examples generated via PGD during the training phase to robustify the visual encoder and alignment layers.
Input Sanitization: Restrict the visual input channel in high-stakes environments or implementing pre-processing steps (e.g., JPEG compression, randomization) to disrupt specific adversarial perturbations, though this may degrade standard performance.
Architectural Hardening: Utilization of cross-attention adapter mechanisms (as seen in Llama 3.2) rather than simple projection layers (LLaVA) correlates with higher relative stability against increasing perturbation magnitudes, though it does not eliminate the vulnerability.

Adversarial Vision Degrades VLM

Research Paper