Multilingual LLM Jailbreaks

Description: Multilingual prompt injection vulnerability in four closed-source Large Language Models (LLMs): GPT-4o, DeepSeek-R1, Gemini-1.5-Pro, and Qwen-Max. Attackers can bypass safety restrictions and elicit harmful or disallowed content by crafting prompts in English or Chinese, leveraging specific structural techniques (e.g., "Two Sides" prompting) that exploit inconsistencies in the models' safety alignment across languages and prompt formats.

Examples: See the paper "The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models" for detailed examples of successful multilingual jailbreak prompts and their variations across different attack strategies. Examples include embedding harmful questions within benign prompts ("Sandwich Attack") and prompting the model to provide arguments both for and against a harmful action ("Two Sides" attack).

Impact: Successful exploitation can lead to the generation of harmful content, including instructions for illegal activities, hate speech, self-harm instructions, and dissemination of misinformation. This compromises the safety and trustworthiness of the LLMs and applications using them.

Affected Systems: OpenAI's GPT-4o, Google DeepMind's Gemini 1.5-Pro, Alibaba Cloud's Qwen-Max, and DeepSeek-R1.

Mitigation Steps:

Enhance language-aware safety alignment mechanisms in LLMs, particularly for languages other than English.
Improve prompt filtering and detection of adversarial techniques like "Two Sides" and sandwich attacks.
Integrate more robust, context-aware safety checks into the real-time inference process.
Conduct more comprehensive multilingual red-teaming exercises.
Develop and utilize more sophisticated detection mechanisms for subtle linguistic techniques used in adversarial prompting (e.g., euphemisms, double entendres in Chinese).

Multilingual LLM Jailbreaks

Research Paper