LALM Audio Jailbreak

Description: End-to-end Large Audio Language Models (LALMs) contain an audio-based jailbreak vulnerability allowing attackers to bypass safety alignment guardrails by manipulating audio-specific "hidden semantics." Unlike text-based attacks, this exploitation involves encoding harmful queries into audio and applying signal processing modifications—specifically changes to emphasis, speech speed, intonation, tone, background noise, celebrity accents, or emotional overlays (e.g., laughter, screaming). These acoustic variations disrupt the model's safety normalization processes in the transformer layers, causing the model to generate harmful, illegal, or unethical content that it would typically refuse if the query were presented in plain text or standard audio. The vulnerability is distinct from adversarial perturbations as it uses perceptible audio edits.

Examples: To reproduce this attack, an attacker converts a harmful text prompt into audio (Text-to-Speech) and applies specific acoustic edits using standard audio processing tools before feeding it to the target LALM.

Attack Vector 1: Speed Modification (via SOX)
Command: sox input.wav output.wav tempo -s 0.5
Effect: Slowing the speech rate to 0.5x can bypass safety filters in models like Qwen2-Audio and SALMONN.
Attack Vector 2: Pitch/Tone Shifting (via SOX)
Command: sox input.wav output.wav pitch 400
Effect: Shifting the pitch by +4 semitones (approx. 400 cents) alters the internal representation of the query sufficient to bypass alignment in susceptible models.
Attack Vector 3: Emphasis/Volume Manipulation (via Librosa)
Method: Selectively amplify specific segments of the audio (e.g., the first 1.0 second) by a factor of $k \in {2, 5, 10}$.
Formula: $x'(t) = k \cdot x(t)$
Attack Vector 4: Query-based Combinatorial Editing
Method: Systematically combining modifications (e.g., applying an accent + speed reduction + background noise) dramatically increases attack success.
Result: In testing, combining edits raised the Attack Success Rate (ASR) against SALMONN-7B from 31.6% to 85.1%.

Impact: Exploitation allows for the complete circumvention of safety guardrails in multimodal systems. This results in the generation of restricted content, including instructions for illegal acts, hate speech, and explicit material. The vulnerability demonstrates that safety alignment performed on text does not transfer robustly to the audio modality in end-to-end models.

Affected Systems:

SALMONN (e.g., SALMONN-7B) - High Vulnerability
Qwen2-Audio (e.g., Qwen2-Audio-7B)
MiniCPM-o-2.6
VITA-1.5
BLSP
SpeechGPT
R1-AQA
GPT-4o-Audio (Vulnerable to specific combinatorial edits, ASR increased from 0.7% to 8.4%)

Mitigation Steps:

Prepended Audio Defense Instructions: Generate a standardized defense prompt via TTS (e.g., "You are a helpful assistant and should refuse to generate illegal, harmful or unethical content.").
Audio Buffering: Prepend this defense audio clip to all incoming user audio queries.
Silence Insertion: Insert a 1000ms silence buffer between the defense prompt and the user's input audio to ensure the model processes the instruction as a distinct context setting.
Defense Effectiveness: This method has been shown to consistently reduce Attack Success Rate (ASR) across all tested LALMs, though it does not eliminate the risk entirely.

LALM Audio Jailbreak

Research Paper