Macaronic T2I Jailbreak

Description: Text-to-Image (T2I) models and their associated safety filters are vulnerable to MacPrompt, a black-box jailbreak technique that exploits cross-lingual embedding alignments. Attackers can bypass input text filters, latent representation filters, and model-level concept removal defenses by replacing sensitive keywords with "macaronic" substitutes. These substitutes are constructed by extracting and recombining character-level substrings from translations of the target word across multiple languages. Because the resulting strings are lexically obfuscated and exploit non-invertible tokenization, they evade text-based safety classifiers and keyword blacklists while still successfully mapping to the target visual concepts in the model's embedding space.

Examples: To bypass filters preventing the generation of specific restricted or banned objects, an attacker replaces the sensitive English word in the prompt with a macaronic cross-lingual string that triggers the same visual semantics:

Replacing "cat" with ktoucpttokatt
Replacing "dog" with nikchocahund
Replacing "car" with vocheav
Replacing "bird" with ápjaro

Impact: Attackers can reliably generate policy-violating images, including sexually explicit content, violent imagery, and banned objects. The technique achieves up to a 100% bypass rate against blacklist filters and a 92% attack success rate against state-of-the-art concept removal defenses, effectively neutralizing both standard and advanced T2I content moderation layers.

Affected Systems:

Stable Diffusion (v2.1)
Concept removal and safety-tuned SD variants (ESD, SLD, FMN, SafeGen, DUO, EAP, PromptGuard, Latent Guard)
Commercial T2I services including DALL·E 3 and Doubao

Macaronic T2I Jailbreak

Research Paper