Thai Cultural Alignment Bypass
Research Paper
ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
View PaperDescription: A vulnerability in the safety alignment of Large Language Models (LLMs) allows attackers to bypass safety guardrails by using malicious prompts contextualized in the Thai language and culture. Evaluated models exhibit a significantly higher Attack Success Rate (ASR) against Thai-specific, culturally contextualized attacks compared to general translated attacks. By exploiting local cultural nuances, regional slang, and Thai socio-cultural contexts, attackers can easily circumvent standard safety filters to elicit harmful responses, particularly in the domain of Thai socio-cultural harms where model performance is notably weaker.
Examples: See the ThaiSafetyBench dataset on HuggingFace and GitHub for 1,954 explicit examples of culturally contextualized jailbreaks, including specific prompts utilizing Thai slang, localized fake news narratives, and deliberate violations of Thai social etiquette.
Impact: Successful exploitation allows attackers to bypass standard LLM safety mechanisms to generate prohibited content. This includes generating culturally-specific hate speech, unfair discrimination, sensitive organizational information leakage, and the dissemination of localized misinformation (e.g., narratives reflecting prevalent fake news in Thai society).
Affected Systems: Various open-source multilingual and regionally-tuned LLMs, including but not limited to:
- Qwen2.5 (7B, 72B Instruct)
- Llama-3.1, 3.2, and 3.3 variants (Instruct)
- Gemma-3 (4B, 12B IT)
- SeaLLMs-v3 (1.5B, 7B)
- Llama-SEA-LION-v3 (8B, 70B)
- Typhoon2 and Typhoon2.1 variants (1B, 3B, 4B, 8B, 12B, 70B)
- OpenThaiGPT1.5 (7B, 72B)
Mitigation Steps:
- Implement culturally tailored safety tuning that specifically addresses local cultural and contextual nuances, linguistic patterns, and region-specific slang.
- Carefully curate Continual Pretraining (CPT) data to integrate region-specific safety datasets rather than relying solely on translated English benchmarks.
- Enhance adversarial filtering during the training and tuning processes to align with region-specific safety requirements and socio-cultural norms.
© 2026 Promptfoo. All rights reserved.