Semantic Audio Jailbreak

Description: Large Audio-Language Models (LAMs) are vulnerable to adversarial signal-level perturbations that allow for the bypass of safety guardrails (jailbreaking). While these models may possess robust text-based safety alignment, they fail to generalize this robustness to the audio modality. Attackers can utilize the Audio Perturbation Toolkit (APT) to apply transformations in the time domain (Energy Distribution Perturbation, Trimming, Fade In/Out), frequency domain (Pitch Shifting, Temporal Scaling), and mixing domain (Extra-auditory Priming, Natural Noise Injection). These perturbations are optimized via Bayesian Optimization to minimize the model's refusal score while maintaining semantic consistency for human listeners (validated via GPTScore and Whisper transcription). When processed, these perturbed audio inputs cause representation shifts that circumvent refusal mechanisms, coercing the model into generating harmful, unethical, or policy-violating content.

Examples: The vulnerability is reproduced by generating adversarial audio samples using the AJailBench-APT+ methodology. Attacks are constructed by applying the following mathematical transformations to a policy-violating text prompt converted to speech:

Temporal Scaling (Eq. 7): Compress or stretch audio using a Phase Vocoder (PV) without altering pitch.

Formula: $x' = \text{iSTFT}(\text{PV}(\text{STFT}(x), \theta_{TS}))$
Effective range: $\theta_{TS} < 0.6$ or $\theta_{TS} > 1.2$ significantly degrades safety.

Fade In/Out (Eq. 4): Apply linear gain ramps to the audio onset/offset.

Formula: $x' = \mathcal{T}{\text{Fade}}(x; \gamma)$ where $\gamma \sim U(0, \theta{\text{Fade}}]$.

Extra-auditory Priming (Eq. 8): Inject sinusoidal signals in infrasound ($f_a < 20Hz$) or ultrasound ($f_a > 20kHz$) ranges.

For pre-computed adversarial audio examples capable of bypassing current safety filters, refer to the AJailBench-APT+ dataset located in the official repository:

Repository: https://github.com/mbzuai-nlp/AudioJailbreak

Impact:

Safety Bypass: Circumvention of alignment protocols intended to prevent the generation of hate speech, disinformation, economic harm instructions, and other prohibited content.
Cross-Modal Exploitation: Attacks succeed against models with strong text-modality defenses (e.g., GPT-4o) by exploiting the weaker robustness of the audio encoder and cross-modal projection.
Automated Abuse: The use of Bayesian Optimization allows for the automated generation of highly effective adversarial samples that require no human-in-the-loop for optimization.

Affected Systems: The following Large Audio-Language Models were tested and found vulnerable to varying degrees (ranked by vulnerability to APT+ attacks):

SpeechGPT (Zhang et al., 2023)
Qwen2-Audio (Chu et al., 2024)
LLama-Omni (Fang et al., 2024)
DiVA (Held et al., 2024)
GPT-4o-audio (OpenAI / Achiam et al., 2023)
Gemini-2.0-flash (Google / Reid et al., 2024)
SALMONN (Tang et al., 2023)

Mitigation Steps:

Adversarial Fine-Tuning: Incorporate semantically preserved perturbed audio samples (generated via APT) into the training alignment stage to improve robustness against signal variations.
Consistency Regularization: Enforce consistency loss across augmented audio views during training to ensure the model produces similar internal representations for clean and perturbed versions of the same input.
Front-end Signal Filtering: Deploy pre-processing filters to detect and sanitize input anomalies, specifically removing infrasound/ultrasound components and normalizing temporal/energy distributions before the audio reaches the LAM encoder.
Acoustic-Context-Aware Refusal: Implement decoding strategies that account for uncertainty and acoustic anomalies, triggering refusal states when high-variance signal perturbations are detected.

Semantic Audio Jailbreak

Research Paper