Universal VLLM Visual Bypass
Research Paper
Transferable adversarial attacks on black-box vision-language models
View PaperDescription: A vulnerability exists in Vision-Language Models (VLLMs) that allows for transferable, targeted adversarial attacks. Attackers can generate adversarial image perturbations using an ensemble of open-source surrogate models (primarily CLIP-based visual encoders) which effectively transfer to proprietary, black-box VLLMs. The attack leverages a specific optimization framework that combines a Visual Contrastive Loss with multiple positive/negative visual examples, rather than relying solely on image-text pairs. The transferability is further amplified through model-level regularization (DropPath, PatchDrop) and data-level augmentation (random Gaussian noise, random cropping, and differentiable JPEG compression) during the perturbation generation. This allows an attacker to manipulate the visual input to induce specific, targeted textual responses from the VLLM, independent of the actual image content.
Examples: To reproduce the attack, an attacker must solve an optimization problem to find a perturbation $\delta$ that minimizes a loss function across an ensemble of surrogate models.
- Setup Surrogate Ensemble: Select a set of open-source visual encoders (e.g., ViT-H, ViT-SigLIP, ConvNeXt XXL, and LLaVA-NeXT visual components).
- Define Loss Function (Visual Contrastive Loss): Instead of standard cross-entropy, minimize the following loss for each surrogate model: $$ \mathcal{L} = -\frac{1}{K}\sum \text{TopK}(\log p(x_i^+)) + \frac{1}{N}\sum_{i=1}^{N}\log p(x_i^-) $$ Where $x_i^+$ are $N=50$ positive image examples aligned with the target semantic (e.g., a "safe" image if the goal is to bypass a filter), and $x_i^-$ are negative examples aligned with the original image.
- Apply Regularization during Optimization:
- DropPath: Skip residual blocks in the surrogate visual model randomly during the forward pass.
- PatchDrop: Randomly drop 20% of visual patches during optimization for ViT-based surrogates.
- Weight Moving Averaging: Apply moving averaging to the perturbation $\delta$ ($\delta_{MA} \leftarrow \delta_{MA} \cdot 0.99 + \delta \cdot 0.01$).
- Apply Data Augmentation:
- Add Gaussian noise: $x \leftarrow x_{\delta} + \frac{\epsilon}{4} \cdot z$.
- Apply Differentiable JPEG compression with quality uniform in $[0.5, 1.0]$.
- Apply Random Resized Crop and Pad.
- Execution:
- Object Misclassification: Input a hazardous image, optimize $\delta$ targeting the embedding of a benign object. The VLLM classifies the hazardous image as benign.
- Text Recognition Manipulation: Input an image of a receipt. Optimize $\delta$ targeting a specific incorrect string (e.g., altering a total value). The VLLM OCR reads the attacker-defined value.
Impact:
- Safety Guardrail Bypass: Malicious actors can bypass visual safety filters by disguising hazardous content (e.g., pornography, violence) as benign objects, causing the model to process and describe content it is aligned to refuse.
- Integrity Violation: Attackers can manipulate the interpretation of documents (e.g., receipts, invoices, medical scans) causing the VLLM to extract incorrect, attacker-chosen text or values, facilitating fraud.
- Universal Misinterpretation: A single perturbation pattern can be generalized to cause consistent misinterpretation across different images and different proprietary models.
Affected Systems:
- Proprietary Models: GPT-4o (OpenAI), Claude 3.5 Sonnet (Anthropic), Gemini 1.5 Pro (Google).
- Open Source Models: Llama 3.2 Vision, Qwen2.5-VL, LLaVA-NeXT.
- Underlying Architectures: Any VLLM utilizing standard visual encoders such as CLIP (ViT, ResNet) or SigLIP for visual feature extraction.
Mitigation Steps:
- Adversarial Training: Incorporate adversarially perturbed images into the pre-training or fine-tuning datasets of the visual encoders (e.g., using techniques similar to AdvXL or TeCoA4) to increase robustness against gradient-based attacks.
- Input Transformation: Implement non-differentiable or randomized input transformations (such as aggressive JPEG compression, random resizing, or diffusion-based purification) at the API level before the image is processed by the model to disrupt the specific structure of the adversarial perturbation.
- Ensemble Defenses: Utilize a diverse set of visual encoders with different architectures and pre-training objectives during inference to verify consistency in visual interpretation.
© 2025 Promptfoo. All rights reserved.