Security issues in vision-language models
GhostPrompt demonstrates a vulnerability in multimodal safety filters used with text-to-image generative models. The vulnerability allows attackers to bypass these filters by using a dynamic prompt optimization framework that iteratively generates adversarial prompts designed to evade both text-based and image-based safety checks while preserving the original, harmful intent of the prompt. This bypass is achieved through a combination of semantically aligned prompt rewriting and the injection of benign visual cues to confuse image-level filters.
Multimodal large language models (MLLMs) are vulnerable to implicit jailbreak attacks that leverage least significant bit (LSB) steganography to conceal malicious instructions within images. These instructions are coupled with seemingly benign image-related text prompts, causing the MLLM to execute the hidden malicious instructions. The attack bypasses existing safety mechanisms by exploiting cross-modal reasoning capabilities.
A vulnerability in text-to-image (T2I) models allows bypassing safety filters through the use of metaphor-based adversarial prompts. These prompts, crafted using LLMs, indirectly convey sensitive content, exploiting the model's ability to infer meaning from figurative language while circumventing explicit keyword filters and model editing strategies.
Multimodal Large Language Models (MLLMs) are vulnerable to Jailbreak-Probability-based Attacks (JPA). JPA leverages a Jailbreak Probability Prediction Network (JPPN) to identify and optimize adversarial perturbations in input images, maximizing the probability of eliciting harmful responses from the MLLM, even with small perturbation bounds and few iterations. The attack operates by modifying the input image's hidden states within the MLLM to increase the predicted jailbreak probability.
FC-Attack leverages automatically generated flowcharts containing step-by-step descriptions derived or rephrased from harmful queries, combined with a benign textual prompt, to jailbreak Large Vision-Language Models (LVLMs). The vulnerability lies in the model's susceptibility to visual prompts containing harmful information within the flowcharts, thus bypassing safety alignment mechanisms.
A bimodal adversarial attack, PBI-Attack, can manipulate Large Vision-Language Models (LVLMs) into generating toxic or harmful content by iteratively optimizing both textual and visual inputs in a black-box setting. The attack leverages a surrogate LVLM to inject malicious features from a harmful corpus into a benign image, then iteratively refines both image and text perturbations to maximize the toxicity of the model’s output as measured by a toxicity detection model (Perspective API or Detoxify).
The Antelope attack exploits vulnerabilities in Text-to-Image (T2I) models' safety filters by crafting adversarial prompts. These prompts, while appearing benign, induce the generation of NSFW images by leveraging semantic similarity between harmless and harmful concepts. The attack involves replacing explicit terms in an original prompt with seemingly innocuous alternatives and appending carefully selected suffix tokens. This manipulation bypasses both text-based and image-based filters, generating sensitive content while maintaining a high degree of semantic alignment with the original intent to evade detection.
A vulnerability exists in several Large Vision-Language Models (LVLMs) where seemingly safe images, when combined with additional safe images and prompts using a specific attack methodology (Safety Snowball Agent), can trigger the generation of unsafe and harmful content. The vulnerability exploits the models' universal reasoning abilities and a "safety snowball effect," where an initial unsafe response leads to progressively more harmful outputs.
Large Vision-Language Models (VLMs) are vulnerable to a novel black-box jailbreak attack, IDEATOR, which leverages a separate VLM to generate malicious image-text pairs. The attacker VLM iteratively refines its prompts based on the target VLM's responses, bypassing safety mechanisms by generating contextually relevant and visually subtle malicious prompts.
© 2025 Promptfoo. All rights reserved.