Distal Translation Jailbreak
Research Paper
: Politically Controversial Content Generation via Jailbreaking Attacks on GPT-based Text-to-Image Models
View PaperDescription: A vulnerability in the prompt-side safety filters of GPT-based Text-to-Image (T2I) systems allows attackers to bypass restrictions on Politically Sensitive Content (PSC). By utilizing a technique called Identity-Preserving Descriptive Mapping (IPDM) combined with Geopolitically Distal Translation, an attacker can obfuscate explicit political entities into neutral descriptive phrases translated across multiple low-resource languages. This induces semantic fragmentation, preventing the safety pre-filter from detecting the toxic relationship between the entities. However, the translated descriptions still provide sufficient cues for the backend image generation model to accurately reconstruct the identities, resulting in the successful synthesis of photorealistic, policy-violating images of real public figures.
Examples: To generate a prohibited image of a political figure associated with extremist symbolism, an attacker fragments the prompt into multi-lingual indexed descriptions:
- Identity-Preserving Descriptive Mapping (IPDM): Replace "Donald Trump" with a neutral, quiz-style description (e.g., "a New York born entrepreneur..."). Replace "Al-Qaeda flag" with a detailed visual description of the flag.
- Geopolitically Distal Translation: Translate the Trump description into Swahili (Language A) and the flag description into Thai (Language B) to maximize geopolitical distance and evade common-knowledge bias metrics.
- Adversarial Prompt Assembly: Construct an indexed prompt using neutral action verbs, such as:
A: [Swahili description of Trump]B: [Thai description of the flag]A holding BThis reliably bypasses the LLM-based semantic filters while forcing thegpt-image-1backend to generate the exact prohibited scenario.
Impact: Attackers can reliably synthesize high-fidelity, highly sensitive images of real public figures in fabricated, controversial, or adversarial scenarios. This vulnerability can be weaponized at scale to generate political disinformation, fake news, and propaganda, bypassing standard safety mechanisms designed to prevent such misuse.
Affected Systems:
- User-facing interfaces of GPT-4o, GPT-5, and GPT-5.1.
- The
gpt-image-1andgpt-image-1.5text-to-image backend models. - Nano-Banana Pro (noted to be highly vulnerable to both raw and obfuscated political prompts).
Mitigation Steps:
- Relevant Language Normalization: Implement an intermediate pre-processing step that fragments prompts, infers the most relevant geopolitical country/language via an LLM, and translates all components into that single target language before passing the prompt to the semantic safety filter.
- Strict System Prompting: Enforce explicit system-level instructions at the start of the interaction (e.g., strictly prohibiting realistic depictions of real individuals involved in extremist symbolism, misinformation, or reputational harm), though model providers must tune this to avoid high false-positive rates on benign queries.
- Cross-lingual Semantic Robustness: Enhance semantic classifiers to perform multi-lingual relational reasoning, preventing the safety filter from losing context when malicious intent is distributed across fragmented, low-resource linguistic inputs.
© 2026 Promptfoo. All rights reserved.