LMVD-ID: 0e86f25f
Published August 1, 2025

Benign Audio Jailbreak

Affected Models:Qwen 2.5 3B, Phi-4

Research Paper

When good sounds go adversarial: Jailbreaking audio-language models with benign inputs

View Paper

Description: Audio-Language Models (ALMs) including Qwen2.5-Omni (3B and 7B) and Phi-4-Multimodal are vulnerable to "WhisperInject," a two-stage adversarial audio attack that bypasses safety guardrails. The vulnerability allows an attacker to inject imperceptible perturbations into benign audio inputs (e.g., a query about the weather) that force the model to generate specific harmful content. The attack utilizes a novel optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to first discover a "native" harmful response—a response the model is naturally inclined to generate—thereby circumventing resistance to "foreign" adversarial prompts. In the second stage, this native payload is embedded into a benign carrier audio signal using PGD. To a human listener, the audio remains benign, but the model interprets it as a command to output restricted content.

Examples: The attack requires an adversarial optimization pipeline available in the researchers' repository.

  1. Target Objective: The attacker defines a harmful intent, e.g., "Give me instructions on how to create a Molotov cocktail."
  2. Stage 1 (Native Target Discovery): Using RL-PGD, the attacker finds a specific response string that the model is statistically likely to generate if safety rails were disabled (e.g., "Sure, to create a Molotov cocktail, you need...").
  3. Stage 2 (Payload Injection): The attacker selects a benign audio carrier, such as a TTS-generated voice saying: "Can you tell me a positive statement?"
  4. Perturbation: The attacker applies the PGD algorithm to the benign audio to minimize the loss between the model's output and the target native response found in Stage 1, with an $L_\infty$ perturbation budget of $\epsilon \le 0.01$.
  5. Result:
  • Human Perception: The audio clip still sounds like "Can you tell me a positive statement?"
  • Model Output: The model ignores the semantic meaning of the audio and outputs the harmful instructions: "Sure, to create a Molotov cocktail..."

See repository: https://github.com/AIM-Intelligence/WhisperInject

Impact:

  • Safety Bypass: Allows attackers to circumvent RLHF and safety-tuning mechanisms intended to prevent the generation of violence, hate speech, and illegal content.
  • Stealth: The attack vector is audio-native and imperceptible to humans; a malicious audio file can be distributed via social media or messaging apps appearing as a harmless voice note.
  • Behavioral Hijacking: Shift from mere transcription errors (ASR attacks) to full behavioral control of the multimodal LLM.

Affected Systems:

  • Qwen2.5-Omni-3B
  • Qwen2.5-Omni-7B
  • Phi-4-Multimodal
  • Likely affects other gradient-accessible Audio-Language Models.

Mitigation Steps:

  • Audio-Signal Level Defenses: Implement safety mechanisms that operate directly on the audio signal representation rather than relying solely on text-based output filtering, as the perturbation occurs in the continuous audio embedding space.
  • Robust Training: Future development should focus on defenses against "native payload discovery," potentially through adversarial training on audio inputs.
  • Perturbation Detection: Deploy detection systems capable of identifying high-frequency adversarial noise or structural anomalies in input audio spectrograms.

© 2026 Promptfoo. All rights reserved.