Fine-Grained Speech LLM Control

Description: Speech-LLMs Qwen2-Audio (7B-Instruct) and Granite-Speech (3.2-8b) are vulnerable to universal acoustic adversarial attacks. An attacker can optimize a fixed, input-agnostic audio segment (approximately 3.2 seconds in length) via gradient-based optimization on the model's frozen weights. When this adversarial segment is prepended to any arbitrary user speech input, it manipulates the model's latent representation, effectively overriding system prompts and generation behavior. This vulnerability manifests in three forms: (1) General Muting, where the model is forced to output an end-of-transcription token immediately, resulting in a denial of service; (2) Task Control, where the model ignores the original system instruction (e.g., "translate to French") and performs an attacker-defined task (e.g., "transcribe in English") regardless of the input content; and (3) Selective Attacks, where the adversarial segment conditionally suppresses output based on latent attributes such as speaker gender or language, functioning normally for non-targeted attributes.

Examples: The attack requires white-box access to the model to generate the adversarial segment $\mathbf{a}$, but the resulting segment is universal and transferable.

General Muting Attack: The attacker optimizes a 3.2-second audio segment $\mathbf{a}$ to maximize the probability of the <eot> (end-of-transcription) token as the first generated token ($y_1$).

Optimization Objective: $\mathbf{\hat{a}} = \arg\max_{\mathbf{a}} P(y_1 = \texttt{eot} \mid \text{Enc}(\mathbf{a} \oplus \mathbf{x}), \mathcal{P}_{\text{src}})$
Execution: The attacker prepends $\mathbf{\hat{a}}$ to any user audio input $\mathbf{x}$.
Result: The model returns an empty string or blank sequence ("

"), ignoring the actual speech content.

Task Control (Prompt Override): The system uses a prompt $\mathcal{P}{\text{src}}$ (e.g., "Translate to French"). The attacker generates $\mathbf{a}$ using reference targets $\mathbf{y}{\text{tgt}}$ corresponding to a different task (e.g., English transcription).

Optimization Objective: $\mathbf{\hat{a}} = \arg\max_{\mathbf{a}} P(\mathbf{y}{\text{tgt}} \mid \text{Enc}(\mathbf{a} \oplus \mathbf{x}), \mathcal{P}{\text{src}})$
Result: Despite the system prompt instructing translation, the model outputs an English transcription of the user's speech.

Selective Gender Suppression: The attacker optimizes $\mathbf{a}$ to output <eot> only when the speaker is female ($f(\mathbf{x})=1$) and function normally otherwise.

Optimization Objective: Maximize $P(\texttt{eot})$ if $f(\mathbf{x})=1$, else maximize $P(\mathbf{y}_{true})$.
Result: The model functions correctly for male speakers but outputs nothing for female speakers (92.2% success rate in suppression on LibriSpeech dev_other).

Impact:

Denial of Service (DoS): Attackers can mute the model for all users or specific demographics, rendering the service unusable.
Prompt Injection/Bypass: Attackers can override safety guidelines or system instructions (e.g., bypassing moderation filters or forcing the model to process prohibited content types).
Fairness and Bias Violation: The vulnerability allows for the creation of discriminatory models that selectively silence specific languages or genders, introducing severe ethical and reputational risks.
Data Integrity Loss: The model produces outputs corresponding to tasks not requested by the system owner, corrupting downstream applications relying on specific output formats (e.g., receiving English text instead of requested French translations).

Affected Systems:

Qwen2-Audio (specifically Qwen2-Audio-7B-Instruct)
Granite-Speech (specifically granite-speech-3.2-8b)
Other Speech-LLMs utilizing similar architectures (Encoder-Adapter-LLM) may be susceptible to identical gradient-based universal perturbation generation.

Mitigation Steps:

Adversarial Training: Incorporate adversarial audio segments into the training dataset. Train the model to map inputs containing these perturbations to the correct reference outputs rather than the adversarial target.
Amplitude Constraints: Enforce strict $l_{\infty}$ norm constraints on input audio. While the paper demonstrates attacks are possible with $\epsilon=0.02$, tighter amplitude clipping or filtering of high-frequency components not typical in human speech may reduce attack success rates.
Input Sanitization: Implement pre-processing layers to detect and strip fixed, repetitive, or high-entropy audio patterns prepended to inputs before they reach the speech encoder.

Fine-Grained Speech LLM Control

Research Paper