Multimodal Vulnerabilities

Vulnerabilities in multimodal AI systems

Related Vulnerabilities

47 entries

Audio Adversarial Jailbreak

5/31/2025

A vulnerability in SpeechGPT allows bypassing safety filters through adversarial audio prompts crafted by a white-box token-level attack. The attacker leverages knowledge of SpeechGPT's internal speech tokenization process to generate adversarial token sequences, which are then synthesized into audio. These audio prompts elicit restricted or harmful outputs the model would normally suppress. The attack's effectiveness relies on the model's discrete audio token representation and does not require access to model parameters or gradients.

Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework

Dynamic Prompt Jailbreak

5/31/2025

GhostPrompt demonstrates a vulnerability in multimodal safety filters used with text-to-image generative models. The vulnerability allows attackers to bypass these filters by using a dynamic prompt optimization framework that iteratively generates adversarial prompts designed to evade both text-based and image-based safety checks while preserving the original, harmful intent of the prompt. This bypass is achieved through a combination of semantically aligned prompt rewriting and the injection of benign visual cues to confuse image-level filters.

GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization

Affects: gpt-3.5, gpt-4.1, dall·e 3, shieldlm-7b, internvl2-2b, deepseek-v3, qwen2.5-7b-instruct, flux.1-schnell

Embodied Agent Jailbreak

5/31/2025

Multimodal Large Language Models (MLLMs) used in Vision-and-Language Navigation (VLN) systems are vulnerable to jailbreak attacks. Adversarially crafted natural language instructions, even when disguised within seemingly benign prompts, can bypass safety mechanisms and cause the VLN agent to perform unintended or harmful actions in both simulated and real-world environments. The attacks exploit the MLLM's ability to follow instructions without sufficient consideration of the consequences of those actions.

BadNAVer: Exploring Jailbreak Attacks On Vision-and-Language Navigation

Affects: internvl3-8b, qwen2.5-vl-7b-instruct, llava-v1.6-mistral-7b, gpt-4o, gemini-2.0-flash, gpt-4o-mini

Hidden Image Jailbreak

5/31/2025

Multimodal large language models (MLLMs) are vulnerable to implicit jailbreak attacks that leverage least significant bit (LSB) steganography to conceal malicious instructions within images. These instructions are coupled with seemingly benign image-related text prompts, causing the MLLM to execute the hidden malicious instructions. The attack bypasses existing safety mechanisms by exploiting cross-modal reasoning capabilities.

Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models

Affects: gpt-4o, gemini 1.5 pro, qwen2.5-vl-72b, gpt-4.5, gemini2.5-pro, intervl2-8b

LLM Multi-Agent IP Leakage

5/31/2025

Large Language Model (LLM)-based Multi-Agent Systems (MAS) are vulnerable to intellectual property (IP) leakage attacks. An attacker with black-box access (only interacting via the public API) can craft adversarial queries that propagate through the MAS, extracting sensitive information such as system prompts, task instructions, tool specifications, number of agents, and system topology.

IP Leakage Attacks Targeting LLM-Based Multi-Agent Systems

Affects: gpt-4o, gpt-4o-mini, llama-3.1-70b, qwen-2.5-72b, llama-3.1-8b

MAD Amplified Jailbreaks

5/4/2025

Multi-Agent Debate (MAD) frameworks leveraging Large Language Models (LLMs) are vulnerable to amplified jailbreak attacks. A novel structured prompt-rewriting technique exploits the iterative dialogue and role-playing dynamics of MAD, circumventing inherent safety mechanisms and significantly increasing the likelihood of generating harmful content. The attack succeeds by using narrative encapsulation, role-driven escalation, iterative refinement, and rhetorical obfuscation to guide agents towards progressively elaborating harmful responses.

Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate

Affects: gpt-4o, gpt-4, gpt-3.5-turbo, deepseek

Multi-Accent Audio Jailbreak

4/12/2025

Multilingual and multi-accent audio inputs, combined with acoustic adversarial perturbations (reverberation, echo, whisper effects), can bypass safety mechanisms in Large Audio Language Models (LALMs), causing them to generate unsafe or harmful outputs. The vulnerability is amplified by the interaction between acoustic and linguistic variations, particularly in languages with less training data.

Multilingual and Multi-Accent Jailbreaking of Audio LLMs

Affects: qwen2-audio, diva-llama-3-v0-8b, meralion-audiollm-whisper-sea-lion, minicpm-o-2.6, ultravox-v0-4.1-llama-3.1-8b

Multimodal Contextual Jailbreak

4/12/2025

Multimodal Large Language Models (MLLMs) are vulnerable to a jailbreaking attack, dubbed PiCo, that leverages token-level typographic attacks on images embedded within code-style instructions. The attack bypasses multi-tiered defense mechanisms, including input filtering and runtime monitoring, by exploiting weaknesses in the visual modality's integration with programming contexts. Harmful intent is concealed within visually benign image fragments and code instructions, circumventing safety protocols.

PiCo: Jailbreaking Multimodal Large Language Models via ctorial de Contextualization

Affects: gpt-4v, gpt-4o, gpt-4-turbo, gemini-1.0 pro vision, llava-1.5

Metaphor-Based T2I Jailbreak

4/12/2025

A vulnerability in text-to-image (T2I) models allows bypassing safety filters through the use of metaphor-based adversarial prompts. These prompts, crafted using LLMs, indirectly convey sensitive content, exploiting the model's ability to infer meaning from figurative language while circumventing explicit keyword filters and model editing strategies.

Metaphor-based Jailbreaking Attacks on Text-to-Image Models

Affects: llama-3-8b-instruct, stable diffusion v1.4, stable diffusion xl, flux, dall·e 3, midjourney

Multimodal Narrative Jailbreak

4/21/2025

Multimodal Large Language Models (MLLMs) are vulnerable to a novel attack vector leveraging narrative-driven visual storytelling and role immersion to circumvent built-in safety mechanisms. The attack, termed MIRAGE, decomposes harmful queries into environment, character, and activity triplets, generating a sequence of images and text prompts that guide the MLLM through a deceptive narrative, ultimately eliciting harmful responses. The attack successfully exploits the MLLM's cross-modal reasoning abilities and susceptibility to persona-based manipulation.

MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks

Affects: llava-mistral, qwen-vl, intern-vl, gemini-1.5-pro, gpt-4v, grok-2v

Page 1 of 5