Joint Audio-Text Jailbreak

Description: Spoken Language Models (SLMs) are vulnerable to Joint Audio-text Multimodal Attacks (JAMA), which bypass safety alignments by simultaneously perturbing both text and audio inputs. The vulnerability exploits the combined optimization of a discrete text suffix via Greedy Coordinate Gradient (GCG) and a continuous audio perturbation via Projected Gradient Descent (PGD). This joint gradient-based attack pushes the model's hidden layer representations into a distinct subspace far from the benign decision boundary, increasing jailbreak success rates by up to 10x compared to unimodal attacks. A computationally cheaper sequential approximation (SAMA) achieves comparable bypass rates by optimizing the text suffix first, followed by the audio perturbation.

Examples: The attack requires submitting a composite prompt consisting of an optimized text sequence and a perturbed audio file.

Text Component: [Malicious Target Query] + [GCG-Optimized Suffix Tokens (e.g., 4-16 tokens)]
Audio Component: [Base Audio (e.g., music or speech)] + [PGD imperceptible perturbation (e.g., 4-8 seconds)] See repository for full reproduction code and datasets: https://repos.lsv.uni-saarland.de/akrishnan/multimodal-jailbreak-slm

Impact: An attacker with white-box access (or transferability) can reliably bypass safety filters to elicit harmful, restricted, or dangerous content (such as malware generation or physical harm instructions) from otherwise aligned multimodal models.

Affected Systems: Safety-aligned Spoken Language Models (SLMs) that process combined text and audio modalities, particularly those supporting differentiable audio feature extraction. Confirmed vulnerable systems include:

Qwen2.5 Omni (7B)
Qwen2 Audio (7B, Instruct)
Audio Flamingo 3
Gemma 3N (E2B, IT)

Mitigation Steps:

Implement Multimodal Guardrails: Evaluate and enforce safety alignments in the composite attack space rather than relying on unimodal (text-only or audio-only) robustness, which is insufficient.
Gradient Shattering: Utilize non-differentiable audio feature extractors (e.g., standard numpy-based processing without PyTorch backpropagation support) to restrict gradient flow into the input audio, effectively neutralizing the PGD component of the attack.
Input Sanitization: Apply perceptual hashing or audio preprocessing to disrupt imperceptible PGD perturbations before the signal reaches the speech encoder.

Joint Audio-Text Jailbreak

Research Paper