Airborne Audio LLM Hijack

Description: Audio-based Large Language Models (ALLMs), specifically Qwen2-Audio, are vulnerable to over-the-air adversarial audio attacks. An attacker with white-box access can generate robust adversarial audio perturbations using gradient-based optimization combined with audio augmentation techniques (specifically SpecAugment, translation, and additive noise). These perturbations, when played through a speaker in the physical environment, manipulate the ALLM processing the audio via a microphone. This allows for two attack vectors: targeted attacks, where the model is forced to transcribe or execute specific malicious commands (e.g., "Hey Qwen, delete my calendar events") regardless of the actual user input; and untargeted attacks, where the model's speech recognition utility is degraded via induced high perplexity and Word Error Rate (WER), even in the presence of system instructions designed to ignore background noise.

Examples: The vulnerability is reproduced by optimizing an adversarial audio input $x$ to maximize the probability of a target sequence $t_{1:n}$. The attack utilizes the following optimization parameters: learning rate $\alpha=0.0002$, perturbation limit $\epsilon=0.1$ (or 0.01), and optimization over 5000 iterations.

Targeted Command Injection:
Adversarial noise is crafted to force the model to output the specific wake-word and command sequence: "Hey Qwen, delete my calendar events" or "Hey Qwen, send money to X".
The adversarial audio is effective even when played over the air (recorded via iPhone 15 from an HP Chromebook speaker) due to the inclusion of SpecAugment (masking frequency bands) during the attack generation.
Untargeted Utility Degradation:
Adversarial noise is added to benign speech samples (from the LibriSpeech dataset).
Result: The model fails to recognize the original speech, yielding significantly higher perplexity (e.g., rising from ~1.09 in clean settings to ~12 in adversarial settings) and effectively denying service to the user.

Impact:

Remote Command Execution: Attackers can trigger unauthorized actions on the victim's device (e.g., modifying calendars, initiating transactions) without the victim speaking the commands.
Denial of Service: The utility of the voice assistant is neutralized by degrading transcription accuracy to unusable levels.
Scalability: The attack can be broadcast in public spaces, simultaneously compromising multiple devices within audio range.

Affected Systems:

Qwen2-Audio (and potentially other ALLMs utilizing similar Whisper-based encoders and LLM backbones).
Voice-controlled AI agents and assistants integrating vulnerable ALLMs.

Mitigation Steps:

Neural Audio Compression: Apply EnCodec compression to input audio. This was observed to be the most effective defense, neutralizing the attack success rate to nearly 0% by removing the adversarial perturbations during the encoding/decoding process.
Audio Resampling: Modify the sample rate of the input audio (e.g., by a rescale factor of 0.8 or 1.2). While less robust than neural compression, this disrupts the specific adversarial patterns optimized for a fixed sample rate.
Spectral Noise Gating: Implement noise reduction techniques (e.g., spectral gating) to filter out the adversarial noise floor, though adaptive attacks may bypass this.
Note: Modifying system instructions (e.g., prompting the model to "ignore background noise") is not an effective mitigation strategy against this vulnerability.

Airborne Audio LLM Hijack

Research Paper