Direct Emoji Jailbreak

Description: Large Language Models (LLMs), specifically Mistral 7B, Gemma 2 9B, and Llama 3 8B, are vulnerable to safety filter bypass via "Emoji-Based Jailbreaking." This adversarial prompt engineering technique exploits the model's tokenization and internal representation of Unicode emoji characters. By utilizing "emoji stuffing" (inserting emojis between textual tokens) or "emoji chaining" (using sequences of emojis as semantic proxies for sensitive terms), attackers can evade keyword-based safety classifiers and token-level filtering. While safety mechanisms often flag explicit textual keywords (e.g., "kill", "attack"), they fail to recognize the malicious intent within emoji sequences, even though the LLM's internal embeddings correctly map these emojis to the restricted concepts (e.g., mapping a knife emoji to "sword" or "cut"). This allows for the generation of restricted content, such as unethical instructions or violence facilitation.

Examples: The vulnerability relies on two primary substitution strategies described in the methodology:

Emoji Stuffing: Inserting innocuous or context-relevant emojis between words to disrupt token sequences expected by safety filters while maintaining semantic coherence for the generative model.
Emoji Chaining/Substitution: Replacing sensitive keywords with emoji proxies.

Concept cited in text: A user replaces the word "cut" or "sword" with a knife emoji (🔪). The safety filter does not flag the emoji, but the model interprets the internal representation of 🔪 as a weapon and proceeds to generate instructions on how to use it violently.
Note: Specific adversarial prompt strings were not provided in the excerpt; however, the attack vector relies on standard Unicode NFC normalized emoji inputs submitted via API.

Impact: successful exploitation allows an attacker to bypass the model's safety alignment and ethical guidelines. This results in the generation of prohibited content, including but not limited to:

Instructions for malicious acts or wrongdoing.
Facilitation of violence.
Hate speech or biased narratives that would otherwise be blocked by text-only filters.
Malicious code generation.

Affected Systems: The following models were empirically proven to be vulnerable (Safety Success Rate > 0%):

Google: Gemma 2 9B (10% Jailbreak Success Rate)
Mistral AI: Mistral 7B (10% Jailbreak Success Rate)
Meta: Llama 3 8B (6% Jailbreak Success Rate)

(Note: Qwen 2 7B was evaluated under the same conditions but exhibited 0% success rate and full alignment.)

Mitigation Steps:

Emoji-Aware Defenses: Implement normalized handling of non-textual tokens within the safety pipeline to ensure emojis are analyzed for semantic intent rather than just surface-level token matching.
Automated Detection: Develop and deploy real-time detection mechanisms specifically tuned for emoji-based jailbreaking and "emoji stuffing" patterns.
Adversarial Training: Expand safety training datasets to include diverse emoji combinations, emoji-augmented prompts, and cultural variations of emoji usage.
Pre-deployment Testing: Organizations should implement systematic testing with emoji-augmented adversarial prompts to quantify vulnerability prior to public release.
Input Normalization: While the study used Unicode NFC normalization to ensure compatibility, safety filters should also normalize or strip obfuscating emoji sequences before passing prompts to the classification layer.

Direct Emoji Jailbreak

Research Paper