Enhanced LVLM Encoder Transfer

Description: Large Vision-Language Models (LVLMs) are vulnerable to zero-query, black-box adversarial image perturbations via Semantic-Guided Multimodal Attacks (SGMA). Unlike traditional attacks that scatter noise or target background pixels, SGMA leverages surrogate models (e.g., CLIP) to anchor imperceptible adversarial perturbations directly onto semantically critical foreground regions. The attack exploits two specific architectural traits of LVLMs: inconsistent visual grounding across models and redundant semantic alignment within models. By jointly optimizing global image-text misalignment and local region-phrase disruption, the perturbations effectively transfer across heterogeneous vision encoders and language modules. This causes the victim LVLM to fail in cross-modal grounding, resulting in confident misclassifications, incorrect image captions, and fabricated visual reasoning.

Examples: In a medical diagnostic scenario using chest X-ray images, an attacker applies an SGMA perturbation under an $L_\infty$ budget of $\epsilon=8/255$.

Clean Image Output (GPT-4.1): "DIAGNOSIS: Lung Opacity, CONFIDENCE: High", accompanied by accurate findings describing multiple abnormal opacities and bilateral consolidations.
Adversarial Image Output (GPT-4.1): "DIAGNOSIS: COVID-19, CONFIDENCE: High", accompanied by completely fabricated radiological findings, such as "bilateral, patchy ground-glass opacities predominantly in the peripheral and lower lung zones."

Impact: This vulnerability compromises the integrity of visual perception in LVLMs. It allows attackers to covertly manipulate the model's textual output without raising visual suspicion. In safety-critical deployments, this can lead to severe consequences, such as forcing automated medical triage systems into generating confident misdiagnoses or tricking autonomous driving perception systems into misinterpreting traffic signs and critical obstacles.

Affected Systems: The vulnerability applies broadly across LVLMs utilizing varied vision encoders and language backbones. Successfully compromised systems evaluated in the study include:

Open-source LVLMs: LLaVA, OpenFlamingo, BLIP-2, Qwen2.5-VL, InternVL3, and Kimi-VL.
Commercial closed-source APIs: GPT-4o and Gemini 2.0 Flash.

Mitigation Steps:

Diffusion-Based Purification (DiffPure): Implement diffusion-based models as an input-level preprocessing step to purify images before they reach the vision encoder. This was shown to be the most effective defense, significantly reducing the attack success rate.
Image Compression and Neural Purification: Apply lossy transformations such as JPEG compression or Neural Representation Purification (NRP) to incoming images to disrupt the continuous adversarial noise patterns.
Adversarial Training: Incorporate adversarial training specifically targeting the vision encoder to build robustness against dense, patch-level semantic disruptions, forcing the model to rely on more robust visual-language priors.

Enhanced LVLM Encoder Transfer

Research Paper