Video-LLM Token Flood

Description: Video-based Large Language Models (Video-LLMs) are vulnerable to a universal Energy-Latency Attack (ELA) that triggers a Denial-of-Service (DoS) via spatially concentrated adversarial patches. Because video architectures rely on temporal subsampling and pooling which act as a low-pass filter against full-frame diffuse noise, an attacker can bypass this compression by anchoring cross-modal attention to a dense, localized visual anomaly. By injecting a fixed, content-agnostic patch into the peripheral regions of video frames, the attacker hijacks the autoregressive decoder. The patch is optimized using masked teacher forcing to steer the model toward computationally expensive "sponge" sequences, while explicitly penalizing the emission of concise answers (e.g., "Yes", "No") and suppressing 'End-of-Sequence' (EOS) token probabilities. This traps the model in an unbounded generation regime regardless of the textual prompt.

Examples: An attacker optimizes a static, universal patch (e.g., 48x48 pixels) offline using a surrogate dataset to suppress the EOS token and specific short-answer vocabulary. The attacker then continuously overlays this identical patch onto a fixed peripheral location (e.g., the bottom-right corner) of a dynamic video stream. When the modified video is fed into the Video-LLM alongside a standard binary-choice prompt (e.g., "Does this scenario require a manual takeover?"), the model ignores its typical conciseness prior. Instead of outputting a single token ("Yes" or "No"), the model hallucinates continuously up to the hard-coded maximum token limit (e.g., max_new_tokens=512), resulting in a 205x token expansion.

Impact: Successfully exploiting this vulnerability leads to extreme computational exhaustion and resource-level Denial-of-Service. By forcing maximum-length outputs, the attack inflates inference latency by over 15x. In real-time, safety-critical streaming pipelines (such as synchronous manual takeover assistance in autonomous driving), this induced latency cascades and blocks subsequent frame processing. This delays the AI's response time well beyond human safety limits (e.g., >2.72 seconds), causing critical safety violations and neutralizing the system's operational availability. Furthermore, the attack remains highly robust against stochastic decoding, maintaining expansion ratios above 240x even at elevated sampling temperatures (e.g., T=1.5).

Affected Systems:

LLaVA-NeXT-Video-7B
Qwen3-VL-4B-Instruct
Video-LLaVA-7B-hf
Any Video-LLM or streaming multimodal system utilizing rigid 5D spatio-temporal tensors (Batch, Time, Channel, Height, Width) susceptible to spatially anchored patch insertions.

Video-LLM Token Flood

Research Paper