LLM-Tuned Image Jailbreak
Research Paper
Jailbreaking Safeguarded Text-to-Image Models via Large Language Models
View PaperDescription: A vulnerability in safeguarded text-to-image models allows bypassing of safety filters and alignment methods through the use of adversarial prompts generated by a fine-tuned large language model (LLM). The attack, termed PromptTune, effectively rewrites unsafe prompts into semantically similar adversarial prompts that evade safety mechanisms, resulting in the generation of harmful images. The attack does not require repeated queries to the target text-to-image model.
Examples: The paper provides examples of prompts successfully bypassing various safety mechanisms (keyword filtering, text embedding filtering, image embedding filtering, MACE, SafeGen). See arXiv:2503.01839. Specific examples include generating images depicting nudity by using euphemisms and avoiding explicit keywords that trigger the safety filters.
Impact: Successful exploitation allows attackers to generate harmful images (e.g., depicting nudity, violence, or hate speech) from safeguarded text-to-image models, despite the presence of safety mechanisms. This undermines the intended safety and security of the model and can lead to the creation and dissemination of harmful content.
Affected Systems: Safeguarded text-to-image models employing safety filters and/or alignment methods, particularly those using CLIP for image-text similarity assessment, are vulnerable. The vulnerability was demonstrated against Stable Diffusion XL Turbo and models using MACE and SafeGen alignment techniques. Specific model versions are not explicitly detailed in the paper.
Mitigation Steps:
- Improve safety filters to detect more sophisticated adversarial prompts.
- Develop more robust alignment methods that are less susceptible to prompt manipulation.
- Integrate an adversarial prompt generation model (such as PromptTune) into the model's training pipeline to enhance robustness against adversarial attacks.
- Implement additional layers of moderation and review processes for generated images to catch bypassed prompts.
© 2025 Promptfoo. All rights reserved.