Vulnerabilities in multimodal AI systems
A vulnerability in SpeechGPT allows bypassing safety filters through adversarial audio prompts crafted by a white-box token-level attack. The attacker leverages knowledge of SpeechGPT's internal speech tokenization process to generate adversarial token sequences, which are then synthesized into audio. These audio prompts elicit restricted or harmful outputs the model would normally suppress. The attack's effectiveness relies on the model's discrete audio token representation and does not require access to model parameters or gradients.
GhostPrompt demonstrates a vulnerability in multimodal safety filters used with text-to-image generative models. The vulnerability allows attackers to bypass these filters by using a dynamic prompt optimization framework that iteratively generates adversarial prompts designed to evade both text-based and image-based safety checks while preserving the original, harmful intent of the prompt. This bypass is achieved through a combination of semantically aligned prompt rewriting and the injection of benign visual cues to confuse image-level filters.
Multimodal Large Language Models (MLLMs) used in Vision-and-Language Navigation (VLN) systems are vulnerable to jailbreak attacks. Adversarially crafted natural language instructions, even when disguised within seemingly benign prompts, can bypass safety mechanisms and cause the VLN agent to perform unintended or harmful actions in both simulated and real-world environments. The attacks exploit the MLLM's ability to follow instructions without sufficient consideration of the consequences of those actions.
Multimodal large language models (MLLMs) are vulnerable to implicit jailbreak attacks that leverage least significant bit (LSB) steganography to conceal malicious instructions within images. These instructions are coupled with seemingly benign image-related text prompts, causing the MLLM to execute the hidden malicious instructions. The attack bypasses existing safety mechanisms by exploiting cross-modal reasoning capabilities.
Large Language Model (LLM)-based Multi-Agent Systems (MAS) are vulnerable to intellectual property (IP) leakage attacks. An attacker with black-box access (only interacting via the public API) can craft adversarial queries that propagate through the MAS, extracting sensitive information such as system prompts, task instructions, tool specifications, number of agents, and system topology.
Multi-Agent Debate (MAD) frameworks leveraging Large Language Models (LLMs) are vulnerable to amplified jailbreak attacks. A novel structured prompt-rewriting technique exploits the iterative dialogue and role-playing dynamics of MAD, circumventing inherent safety mechanisms and significantly increasing the likelihood of generating harmful content. The attack succeeds by using narrative encapsulation, role-driven escalation, iterative refinement, and rhetorical obfuscation to guide agents towards progressively elaborating harmful responses.
Multilingual and multi-accent audio inputs, combined with acoustic adversarial perturbations (reverberation, echo, whisper effects), can bypass safety mechanisms in Large Audio Language Models (LALMs), causing them to generate unsafe or harmful outputs. The vulnerability is amplified by the interaction between acoustic and linguistic variations, particularly in languages with less training data.
Multimodal Large Language Models (MLLMs) are vulnerable to a jailbreaking attack, dubbed PiCo, that leverages token-level typographic attacks on images embedded within code-style instructions. The attack bypasses multi-tiered defense mechanisms, including input filtering and runtime monitoring, by exploiting weaknesses in the visual modality's integration with programming contexts. Harmful intent is concealed within visually benign image fragments and code instructions, circumventing safety protocols.
A vulnerability in text-to-image (T2I) models allows bypassing safety filters through the use of metaphor-based adversarial prompts. These prompts, crafted using LLMs, indirectly convey sensitive content, exploiting the model's ability to infer meaning from figurative language while circumventing explicit keyword filters and model editing strategies.
Multimodal Large Language Models (MLLMs) are vulnerable to a novel attack vector leveraging narrative-driven visual storytelling and role immersion to circumvent built-in safety mechanisms. The attack, termed MIRAGE, decomposes harmful queries into environment, character, and activity triplets, generating a sequence of images and text prompts that guide the MLLM through a deceptive narrative, ultimately eliciting harmful responses. The attack successfully exploits the MLLM's cross-modal reasoning abilities and susceptibility to persona-based manipulation.
© 2025 Promptfoo. All rights reserved.