Visual Compression Mismatch Attack

Description: A vulnerability exists in Large Vision-Language Models (LVLMs) utilizing visual token compression mechanisms (e.g., VisionZip, VisPruner) to reduce inference latency. The vulnerability stems from an optimization-inference mismatch where standard adversarial defenses assume full-token processing, while the deployed model utilizes a subset of tokens selected via importance metrics (typically attention scores).

Attackers can exploit this via a Compression-AliGnEd (CAGE) attack, which optimizes perturbations to maximize Expected Feature Disruption (EFD) and enforces Rank Distortion Alignment (RDA). This technique manipulates the token selection logic to prioritize heavily perturbed tokens (survivors) while suppressing cleaner, low-importance tokens. Consequently, the compression mechanism acts as a "distortion concentrator," ensuring the downstream LLM processes a visual representation dominated by adversarial artifacts, significantly bypassing implicit denoising effects of pruning.

Examples: To reproduce the attack on a victim model (e.g., LLaVA-1.5 with VisionZip compression):

Define the Survivor Probability: Model the unknown deployment token budget $K_{model}$ as a random variable to compute the survival probability $\pi_i(\mathbf{x})$ for each visual token based on its rank $r_i(\mathbf{x})$ in the attention map: $$ \pi_i(\mathbf{x}) = \sum_k P(K_{model}=k) \cdot \mathbb{I}(r_i(\mathbf{x}) < k) $$
Compute Expected Feature Disruption (EFD): Calculate the loss to concentrate distortion on probable survivors: $$ \mathcal{L}{EFD}(\mathbf{x}) = \frac{\sum{i=1}^{N} \pi_i(\mathbf{x}) \cdot (1 - \mathcal{S}(\mathbf{z}_i^{adv}, \mathbf{z}i^{cln}))}{\sum{i=1}^{N} \pi_i(\mathbf{x})} $$ (Where $\mathcal{S}$ is cosine similarity).
Compute Rank Distortion Alignment (RDA): Differentiably align the attention score distribution $p^{(s)}$ with the token distortion distribution $p^{(d)}$: $$ \mathcal{L}{RDA}(\mathbf{x}) = \sum{i=1}^{N} p_i^{(d)}(\mathbf{x}) \log p_i^{(s)}(\mathbf{x}) $$
Optimization: Update the input image $\mathbf{x}$ using PGD to maximize $\mathcal{L}{total} = \mathcal{L}{EFD} + \lambda \cdot \mathcal{L}_{RDA}$.

See the mathematical formulation in Section 4 of the referenced paper.

Impact:

Adversarial Robustness Degradation: Significant reduction in robust accuracy on visual question answering (VQA) and reasoning tasks compared to standard encoder-based attacks. For example, on TextVQA, robust accuracy can drop by >10% relative to baselines under tight token budgets.
Targeted Misinformation: The model is forced to rely on corrupted visual evidence, leading to hallucinations or incorrect text generation in safety-critical, latency-sensitive applications (e.g., mobile agents, autonomous driving).
Defense Bypass: Bypasses "security through obscurity" regarding the specific token budget ($K$) used in deployment.

Affected Systems:

LVLM Architectures: LLaVA (v1.5, NeXT), Qwen2.5-VL, and derivatives.
Compression Mechanisms (Outer-LLM):
VisionZip
VisPruner
DivPrune
FlowCut
PruMerge

Mitigation Steps:

Robustness-Aware Selection (D1): Modify the token selection mechanism to penalize tokens with unstable attention scores. Rank tokens by $s_i(x) = \mathbb{E}_m[a_i(x+\delta_m)] - \beta \cdot \mathrm{Std}_m[a_i(x+\delta_m)]$, where $\delta_m$ is random noise.
Stochastic Candidate Pool (D2): Introduce randomness into the bottleneck to break the deterministic predictability of survivors. Select the top-$(K+\Delta)$ tokens and uniformly sample $K$ survivors from this pool.
Attention-Based Detection: Implement a lightweight detector that monitors the "peakiness" of the attention distribution. Adversarial attacks tend to disperse attention; a threshold on the cumulative attention mass of the Top-K tokens can identify malicious inputs.

Visual Compression Mismatch Attack

Research Paper