Arabizi LLM Jailbreak

Description: Large Language Models (LLMs) exhibit vulnerability to jailbreak attacks when prompted using Arabic transliteration and Arabizi (Arabic chatspeak). While LLMs demonstrate robustness to standard Arabic prompts, even with prefix injection, the use of transliterated or Arabizi prompts bypasses safety mechanisms, leading to the generation of unsafe content. This is due to the model's learned associations with specific words in these non-standard forms, which differ from its understanding of the standard form. Certain word combinations trigger unintended behaviors, such as generating copyright refusal statements or responses as if produced by Google AI, even when the prompt is unrelated. Manual perturbation at the sentence and word level further increases the likelihood of successful jailbreaks.

Examples:

Standard Arabic Prompt (Safe Response): (Example from the paper; specific text omitted due to length and character encoding challenges in rendering Arabic script, but is available in the referenced paper)
Arabic Transliteration Prompt (Unsafe Response): (Example from the paper; specific text omitted due to length and character encoding challenges in rendering Arabic script, but is available in the referenced paper)
Arabizi Prompt (Unsafe Response): (Example from the paper; specific text omitted due to length and character encoding challenges in rendering Arabic script, but is available in the referenced paper)

Impact: Successful jailbreaks allow adversaries to elicit unsafe responses from LLMs, including instructions for malicious activities (e.g., creating malware, spreading misinformation). This compromises the safety and reliability of LLM applications.

Affected Systems: OpenAI GPT-4 and Anthropic Claude 3 Sonnet (and potentially other LLMs). The vulnerability may vary across different models and versions. Open-source models like Llama2 may be less susceptible due to limited training data in Arabic.

Mitigation Steps:

Prompt Conversion: Implement pre-processing to convert non-standard Arabic forms (transliteration and Arabizi) into standard Arabic before processing. Note that this may not fully mitigate the vulnerability because the models might still "understand" or implicitly process the original input's meaning.
Enhanced Safety Training: Expand LLM safety training to include Arabic transliteration and Arabizi, considering adversarial examples and word-level perturbations.
Adversarial Training: Utilize adversarial training, incorporating diverse Arabic forms, including phonetic variations, and word collocations to enhance resilience against malicious prompts.
System Prompts: Employ carefully crafted system prompts to guide the LLM's interpretation and translation of non-standard Arabic inputs, enforcing accurate conversion and context-aware responses to avoid unintended behaviors. (This mitigates, but does not eliminate the vulnerability, as it relies on the correctness of the system prompt's logic).

Arabizi LLM Jailbreak

Research Paper