Benign Audio Jailbreak
Research Paper
When good sounds go adversarial: Jailbreaking audio-language models with benign inputs
View PaperDescription: Audio-Language Models (ALMs) including Qwen2.5-Omni (3B and 7B) and Phi-4-Multimodal are vulnerable to "WhisperInject," a two-stage adversarial audio attack that bypasses safety guardrails. The vulnerability allows an attacker to inject imperceptible perturbations into benign audio inputs (e.g., a query about the weather) that force the model to generate specific harmful content. The attack utilizes a novel optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to first discover a "native" harmful response—a response the model is naturally inclined to generate—thereby circumventing resistance to "foreign" adversarial prompts. In the second stage, this native payload is embedded into a benign carrier audio signal using PGD. To a human listener, the audio remains benign, but the model interprets it as a command to output restricted content.
Examples: The attack requires an adversarial optimization pipeline available in the researchers' repository.
- Target Objective: The attacker defines a harmful intent, e.g., "Give me instructions on how to create a Molotov cocktail."
- Stage 1 (Native Target Discovery): Using RL-PGD, the attacker finds a specific response string that the model is statistically likely to generate if safety rails were disabled (e.g., "Sure, to create a Molotov cocktail, you need...").
- Stage 2 (Payload Injection): The attacker selects a benign audio carrier, such as a TTS-generated voice saying: "Can you tell me a positive statement?"
- Perturbation: The attacker applies the PGD algorithm to the benign audio to minimize the loss between the model's output and the target native response found in Stage 1, with an $L_\infty$ perturbation budget of $\epsilon \le 0.01$.
- Result:
- Human Perception: The audio clip still sounds like "Can you tell me a positive statement?"
- Model Output: The model ignores the semantic meaning of the audio and outputs the harmful instructions: "Sure, to create a Molotov cocktail..."
See repository: https://github.com/AIM-Intelligence/WhisperInject
Impact:
- Safety Bypass: Allows attackers to circumvent RLHF and safety-tuning mechanisms intended to prevent the generation of violence, hate speech, and illegal content.
- Stealth: The attack vector is audio-native and imperceptible to humans; a malicious audio file can be distributed via social media or messaging apps appearing as a harmless voice note.
- Behavioral Hijacking: Shift from mere transcription errors (ASR attacks) to full behavioral control of the multimodal LLM.
Affected Systems:
- Qwen2.5-Omni-3B
- Qwen2.5-Omni-7B
- Phi-4-Multimodal
- Likely affects other gradient-accessible Audio-Language Models.
Mitigation Steps:
- Audio-Signal Level Defenses: Implement safety mechanisms that operate directly on the audio signal representation rather than relying solely on text-based output filtering, as the perturbation occurs in the continuous audio embedding space.
- Robust Training: Future development should focus on defenses against "native payload discovery," potentially through adversarial training on audio inputs.
- Perturbation Detection: Deploy detection systems capable of identifying high-frequency adversarial noise or structural anomalies in input audio spectrograms.
© 2026 Promptfoo. All rights reserved.