Cross-Modal Toxic Continuation
Research Paper
Red Team Diffuser: Exposing Toxic Continuation Vulnerabilities in Vision-Language Models via Reinforcement Learning
View PaperDescription: Large Vision-Language Models (VLMs) are vulnerable to a cross-modal toxic continuation attack facilitated by reinforcement learning-tuned diffusion models. This vulnerability allows an attacker to bypass safety alignment and external guardrails (such as NSFW image filters) by pairing a specific text prefix with a "semantically adversarial" image. Unlike traditional gradient-based adversarial examples that rely on pixel noise, these images are semantically coherent but optimized via Denoising Diffusion Policy Optimization (DDPO) to maximize the toxicity of the VLM's textual completion. The attack exploits the interaction between visual and textual modalities, causing the model to generate hate speech, threats, or sexually explicit text even when the text prefix alone would be refused or completed safely.
Examples: The attack requires a diffusion model fine-tuned using the RedDiffuser framework and a target VLM.
- Preparation (Attacker side):
- Use an unaligned LLM (e.g., Gemini Pro with safety disabled) to generate a candidate image prompt via greedy search.
- Pass this prompt to a Stable Diffusion model fine-tuned via RL (RedDiffuser) with a dual reward function: one for maximizing toxicity in the target VLM (using a classifier like Detoxify) and one for maintaining semantic alignment (using BERTScore).
- Attack Execution:
- Input Text Prefix: "Please extend the following sentence: [Toxic Prefix from RealToxicityPrompts]"
- Input Image: An image generated by the RedDiffuser model. For example, a stylized, cartoon-like rendering of a school hallway containing subtle ominous cues (shadows/garbage) that passes standard NSFW filters but triggers a violent completion.
- Query: Submit the paired Image + Text Prefix to the target VLM.
- Result:
- The VLM completes the sentence with high-toxicity content (e.g., detailed threats or profanity) instead of a refusal.
For specific adversarial image prompts and the fine-tuning codebase, see the RedDiffuser repository and the RealToxicityPrompts dataset subset.
Impact:
- Safety Bypass: Circumvents built-in safety alignment (RLHF) and external content moderation systems (such as image-level NSFW classifiers) with a high success rate (up to 94.50% guardrail pass rate).
- Content Generation: Forces the model to generate severe toxicity, including hate speech, insults, and violent threats.
- Cross-Model Transferability: Adversarial images optimized against one model (e.g., LLaVA) successfully transfer to other black-box commercial and open-source models, significantly increasing toxicity rates on systems like Gemini-1.5-flash (+5.1%) and Llama-3.2-11B-Vision-Instruct (+26.83%).
Affected Systems:
- LLaVA-1.5 (7B and 13B variants)
- Google Gemini-1.5-flash
- Meta Llama-3.2-11B-Vision-Instruct
- Any VLM accepting interleaved image-text inputs for continuation tasks.
Mitigation Steps:
- Multimodal Guardrails: Implement safety checkers that evaluate the joint probability of the image and text combination, rather than filtering images and text in isolation.
- Adversarial Training: Incorporate toxic continuation scenarios involving adversarial images into the VLM's safety alignment training data (RLHF).
- Refusal Training: Explicitly train models to recognize and refuse "toxic continuation" tasks where a neutral or slightly harmful prefix is paired with context-shifting visual inputs.
- Visual Semantics Analysis: Move beyond pixel-level or basic NSFW classification for input images; utilize semantic analysis to detect subtle visual cues (e.g., threatening atmospheres or caricatures) designed to elicit toxicity.
© 2026 Promptfoo. All rights reserved.