Automated T2I Jailbreak
Research Paper
Automatic Jailbreaking of the Text-to-Image Generative AI Systems
View PaperDescription: Commercial text-to-image (T2I) generative AI systems are vulnerable to automated prompt injection attacks that bypass their safety mechanisms and induce the generation of copyrighted material. An attacker can use an automated pipeline to generate prompts that precisely describe copyrighted images, effectively circumventing word-based detection filters.
Examples: See the paper's repository at https://github.com/Kim-Minseon/APGP.git for code and examples of generated prompts. Illustrative examples from the paper show that by carefully crafting prompts, even systems with high initial block rates (e.g., ChatGPT blocking 84% of naive prompts) can be successfully "jailbroken" to generate copyrighted content in a significant percentage of cases (e.g., 76% for ChatGPT with the described attack). Specific example prompts are provided in the paper's Appendix B.3 and Table 4.
Impact: Unauthorized reproduction of copyrighted material, leading to copyright infringement and potential legal liabilities for the AI service provider and users. This compromises the integrity and trustworthiness of the AI systems.
Affected Systems: Commercial text-to-image generative AI systems, including but not limited to: ChatGPT, Copilot, Gemini, and Midjourney (as of May 2024; vulnerability may affect other systems as well).
Mitigation Steps:
- Implement more robust content filtering mechanisms that go beyond simple keyword matching. This may involve analyzing the semantic meaning of prompts and assessing the similarity of generated images to copyrighted material using more sophisticated techniques than simple cosine similarity.
- Develop techniques to detect and prevent the generation of near-exact replicas of images from the training dataset.
- Employ diverse, multi-layered security strategies which incorporate both prompt filtering and post-generation image analysis.
- Regularly update safety filters and models based on ongoing research into adversarial attacks and vulnerabilities.
- Conduct thorough red-teaming exercises using advanced automated prompt generation techniques to identify and address vulnerabilities systematically.
© 2025 Promptfoo. All rights reserved.