Macaronic T2I Jailbreak
Research Paper
MacPrompt: Maraconic-guided Jailbreak against Text-to-Image Models
View PaperDescription: Text-to-Image (T2I) models and their associated safety filters are vulnerable to MacPrompt, a black-box jailbreak technique that exploits cross-lingual embedding alignments. Attackers can bypass input text filters, latent representation filters, and model-level concept removal defenses by replacing sensitive keywords with "macaronic" substitutes. These substitutes are constructed by extracting and recombining character-level substrings from translations of the target word across multiple languages. Because the resulting strings are lexically obfuscated and exploit non-invertible tokenization, they evade text-based safety classifiers and keyword blacklists while still successfully mapping to the target visual concepts in the model's embedding space.
Examples: To bypass filters preventing the generation of specific restricted or banned objects, an attacker replaces the sensitive English word in the prompt with a macaronic cross-lingual string that triggers the same visual semantics:
- Replacing "cat" with
ktoucpttokatt - Replacing "dog" with
nikchocahund - Replacing "car" with
vocheav - Replacing "bird" with
ápjaro
Impact: Attackers can reliably generate policy-violating images, including sexually explicit content, violent imagery, and banned objects. The technique achieves up to a 100% bypass rate against blacklist filters and a 92% attack success rate against state-of-the-art concept removal defenses, effectively neutralizing both standard and advanced T2I content moderation layers.
Affected Systems:
- Stable Diffusion (v2.1)
- Concept removal and safety-tuned SD variants (ESD, SLD, FMN, SafeGen, DUO, EAP, PromptGuard, Latent Guard)
- Commercial T2I services including DALL·E 3 and Doubao
© 2026 Promptfoo. All rights reserved.