Progressive Attention LVLM Attack

Description: Large Vision-Language Models (LVLMs) are vulnerable to a Stage-wise Attention-Guided Attack (SAGA) that allows for the generation of highly transferable, imperceptible adversarial examples. The vulnerability stems from a positive correlation between regional cross-modal attention scores and adversarial loss sensitivity in LVLMs. An attacker can exploit this by extracting an attention map from a surrogate open-source model (e.g., Qwen3-VL) to identify high-attention "hotspots." SAGA utilizes a stage-wise optimization schedule that allocates the $L_\infty$ perturbation budget to these hotspots first, then progressively targets subsequent salient regions as the model's attention redistributes during the attack. This method bypasses visual encoders and aligns the image representation with a malicious target text embedding, causing the target LVLM to output attacker-defined captions or answers while the input image remains visually benign to humans.

Examples:

Target: Cause various commercial LVLMs to hallucinate specific animals in a landscape image.
Input: A source image containing only rocks and trees.
Target Text: "Two giraffes stand near a beige building, surrounded by rocks and lush green trees in a zoo-like enclosure."
Attack Execution: Using attention maps extracted from Qwen3-VL-8B, perturbations are applied to the source image.
Result: When the adversarial image is fed to Gemini-3 Pro, GPT-5.2 Pro, and Grok-4.1 Thinking with the prompt "Describe this image," the models output descriptions confirming the presence of giraffes.
See repository: https://github.com/jackwaky/SAGA

Impact:

Integrity Violation: Attackers can manipulate model outputs in Image Captioning and Visual Question Answering (VQA) tasks, forcing the model to hallucinate objects, misinterpret scenes, or output specific target phrases.
Black-Box Transferability: Adversarial images generated using open-source proxies are effective against closed-source, API-based models (e.g., Gemini, GPT-4/5) without requiring access to the target model's gradients or weights.
Safety Bypass: Potential to bypass visual safety filters or guardrails by aligning visual embeddings with benign-looking but semantically malicious target texts.

Affected Systems:

Closed-Source Models: Gemini-2.5-Flash, Gemini-3-Pro-Preview, GPT-4.1, GPT-5 Mini, Grok 4 Fast.
Open-Source Models: LLaVA-1.5-7B, Gemma-3-4B, LLaVA-4 Maverick, Qwen3-VL-30B-A3B-Instruct, Qwen3-VL-235B-A22B-Instruct.

Mitigation Steps:

Develop LVLM architectures and training objectives that explicitly minimize the correlation between regional attention scores and adversarial loss sensitivity.
Implement adversarial training procedures that include stage-wise, attention-guided adversarial examples in the training dataset to improve robustness against localized perturbations.
Note: This is a systemic vulnerability in current cross-modal attention mechanisms; no simple patch exists.

Progressive Attention LVLM Attack

Research Paper