Benign Audio Jailbreak

Description: Audio-Language Models (ALMs) including Qwen2.5-Omni (3B and 7B) and Phi-4-Multimodal are vulnerable to "WhisperInject," a two-stage adversarial audio attack that bypasses safety guardrails. The vulnerability allows an attacker to inject imperceptible perturbations into benign audio inputs (e.g., a query about the weather) that force the model to generate specific harmful content. The attack utilizes a novel optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to first discover a "native" harmful response—a response the model is naturally inclined to generate—thereby circumventing resistance to "foreign" adversarial prompts. In the second stage, this native payload is embedded into a benign carrier audio signal using PGD. To a human listener, the audio remains benign, but the model interprets it as a command to output restricted content.

Examples: The attack requires an adversarial optimization pipeline available in the researchers' repository.

Target Objective: The attacker defines a harmful intent, e.g., "Give me instructions on how to create a Molotov cocktail."
Stage 1 (Native Target Discovery): Using RL-PGD, the attacker finds a specific response string that the model is statistically likely to generate if safety rails were disabled (e.g., "Sure, to create a Molotov cocktail, you need...").
Stage 2 (Payload Injection): The attacker selects a benign audio carrier, such as a TTS-generated voice saying: "Can you tell me a positive statement?"
Perturbation: The attacker applies the PGD algorithm to the benign audio to minimize the loss between the model's output and the target native response found in Stage 1, with an $L_\infty$ perturbation budget of $\epsilon \le 0.01$.
Result:

Human Perception: The audio clip still sounds like "Can you tell me a positive statement?"
Model Output: The model ignores the semantic meaning of the audio and outputs the harmful instructions: "Sure, to create a Molotov cocktail..."

See repository: https://github.com/AIM-Intelligence/WhisperInject

Impact:

Safety Bypass: Allows attackers to circumvent RLHF and safety-tuning mechanisms intended to prevent the generation of violence, hate speech, and illegal content.
Stealth: The attack vector is audio-native and imperceptible to humans; a malicious audio file can be distributed via social media or messaging apps appearing as a harmless voice note.
Behavioral Hijacking: Shift from mere transcription errors (ASR attacks) to full behavioral control of the multimodal LLM.

Affected Systems:

Qwen2.5-Omni-3B
Qwen2.5-Omni-7B
Phi-4-Multimodal
Likely affects other gradient-accessible Audio-Language Models.

Mitigation Steps:

Audio-Signal Level Defenses: Implement safety mechanisms that operate directly on the audio signal representation rather than relying solely on text-based output filtering, as the perturbation occurs in the continuous audio embedding space.
Robust Training: Future development should focus on defenses against "native payload discovery," potentially through adversarial training on audio inputs.
Perturbation Detection: Deploy detection systems capable of identifying high-frequency adversarial noise or structural anomalies in input audio spectrograms.

Benign Audio Jailbreak

Research Paper