Sound Induces Trimodal Collapse

Description: A vulnerability exists in trimodal audio-video-language models where an attacker can systematically degrade multimodal reasoning through untargeted, audio-only adversarial perturbations. By optimizing a shared perturbation $\delta$ applied to the audio channel, an attacker can manipulate internal representations—specifically targeting audio encoder embeddings and cross-modal attention mechanisms—without modifying visual or textual inputs. The attack exploits the model's reliance on the audio encoder (e.g., BEATs, Whisper-style encoders) and its projection into the language model space. The most effective attack vectors maximize the orthogonality between clean and perturbed audio embeddings ($\mathcal{L}^{(\cos)}$) or artificially amplify attention allocation to the audio tokens ($\mathcal{L}^{(\text{audioatt})}$). This results in a high-confidence divergence from the ground truth answer, effectively blinding the model to visual context despite the video input remaining clean.

Examples: The vulnerability is reproduced by optimizing an additive perturbation $\delta$ using Projected Gradient Descent (PGD) against the audio encoder's embedding space.

Encoder-Space Attack Formulation: The attacker targets the audio encoder $f_a$ and projection layer $g_a$. The objective is to minimize the cosine similarity between the clean audio embeddings ($E_{\text{clean}}$) and the perturbed audio embeddings ($E_{\text{adv}}$): $$ \mathcal{L}^{(\cos)}(\delta) = \frac{1}{N_e} \sum_{i=1}^{N_e} \frac{\mathbf{E}{\text{clean}}[i] \cdot \mathbf{E}{\text{adv}}[i]}{|\mathbf{E}{\text{clean}}[i]|2 \cdot |\mathbf{E}{\text{adv}}[i]|2} $$ Where $\mathbf{E}{\text{adv}} = g_a(f_a(x_a + \delta))$. Optimization is performed using Adam (learning rate $10^{-4}$) with an $\ell\infty$ constraint $\epsilon$.
Audio Attention Amplification: Alternatively, the attacker maximizes the attention mass assigned to audio tokens $\mathcal{T}a$ to force the model to over-rely on the corrupted audio stream: $$ \mathcal{L}^{(\text{audioatt})}(\delta) = - \sum{l=1}^{L} \sum_{h=1}^{H} \sum_{i=1}^{N} \sum_{j \in \mathcal{T}a} \mathbf{A}^{(l,h)}{i,j} $$ Where $\mathbf{A}^{(l,h)}$ represents the attention matrix at layer $l$ and head $h$.
Reference Implementation: See mathematical formulations in Appendix C.3 and C.5 of the paper. See dataset AVQA for reproduction targets.

Impact:

Multimodal Reasoning Failure: The attack achieves up to a 96.03% Attack Success Rate (ASR) on benchmarks like AVQA, reducing model accuracy from ~95% to ~4%.
Model Hallucination: The model generates incorrect answers with high confidence (confidence scores comparable to clean inputs), rendering output uncertainty metrics ineffective for detection.
Stealthiness: The attack is effective at low perceptual distortions (LPIPS $\le 0.08$) and does not require manipulation of the video feed or text prompt, making it difficult to detect via visual inspection or text-based filters.

Affected Systems:

VideoLLAMA2 (specifically using BEATs encoder and Qwen2-7B-Instruct backbone).
Qwen 2.5 Omni (using Whisper-style audio encoder).
Qwen 3 Omni (using Whisper-style audio encoder).
Any trimodal foundation model employing independent audio encoders projected into a shared LLM embedding space without explicit robustness training.

Mitigation Steps:

Cross-Modal Consistency Checking: Implement inference-time defenses that compare predictions derived from unimodal inputs (video-only vs. audio-only) against the trimodal output to detect inconsistencies caused by single-modality attacks.
Adversarial Training: Integrate adversarially perturbed audio samples into the instruction-tuning dataset to improve the robustness of the audio projector and LLM backbone.
Input Preprocessing: Apply audio purification techniques or re-encoding steps prior to the audio encoder to disrupt the structure of the adversarial perturbation $\delta$.
Gradient Shattering (Experimental): While not explicitly validated in the paper for trimodal models, techniques that disrupt gradient flow in the audio encoder (similar to AudioGuard) may reduce the efficacy of white-box optimization.

Sound Induces Trimodal Collapse

Research Paper