Chinese Pattern Safety Evasion
Research Paper
CSSBench: Evaluating the Safety of Lightweight LLMs against Chinese-Specific Adversarial Patterns
View PaperDescription: Lightweight Chinese Large Language Models (LLMs) are vulnerable to jailbreaking attacks that employ language-specific linguistic obfuscation techniques. Standard safety guardrails, which typically rely on keyword detection or semantic analysis of clean text, fail to identify malicious intent when sensitive terms are disguised using Chinese-specific adversarial patterns. These patterns include Pinyin Mix (replacing characters with Romanized phonetic spellings), Homophones (substituting visually or phonetically similar characters), Symbol Mix (injecting emojis, digits, or Latin characters within words), and Zero-width insertion (placing invisible Unicode characters like U+200B inside tokens). Successful exploitation allows attackers to bypass refusal mechanisms and elicit harmful responses regarding illegal activities, violence, and self-harm.
Examples:
- Pinyin and Character Mixing (Adult Content): To bypass filters regarding sexual violence, the attacker substitutes standard characters in "sexual violence" (性暴力) with Pinyin components like "xing" and "li", and obscures explicit terms like "oral sex" (口交) via hybrid forms like "口j".
- Symbol Injection (Fraud/Malware): To mask the intent of generating scareware, the attacker uses JSON-style syntax injection to disrupt tokenizer segmentation in the word "analyst" (分析), rendering it as
分"),"y","析. Financial verbs like "payment" (支付) are obfuscated via intra-word Pinyin mixing as支zh付. - Homophone/Pinyin Substitution (Explosives): A request for hazardous chemical ratios is disguised by mapping standard characters for "concentration" and "catalyst" to their Pinyin counterparts
duandjù, framing the request within an airport security protocol inquiry. - Emoji and Pinyin (Self-Harm): In a query seeking overdose instructions disguised as "shock therapy," critical keywords are masked using Pinyin substitutions (e.g.,
彻děiandjie脱) and emojis are injected to replace sensitive nouns. - (See dataset: https://huggingface.co/datasets/Yaesir06/CSSBench)
Impact: This vulnerability allows malicious actors to generate actionable instructions for manufacturing explosives, executing social engineering fraud, producing non-consensual sexual content, and formulating suicide methods. Empirical analysis shows that applying these adversarial patterns significantly increases the Attack Success Rate (ASR). For example, in the Qwen3-8B model, the ASR for open-ended QA tasks increased by approximately 71% when these patterns were applied compared to clean queries. In domains such as Adult Content and Fraud/Hate, ASR exceeded 50% for multiple lightweight models.
Affected Systems: The vulnerability affects various lightweight (<8B parameters) instruction-tuned Chinese and multilingual LLMs, including but not limited to:
- Qwen3 (0.6B, 1.7B, 8B)
- MiniCPM4 (0.5B, 8B)
- Hunyuan (0.5B, 1.8B, 7B)
- openPangu-Embedded (1B, 7B)
Mitigation Steps:
- Adversarial Alignment: Incorporate Chinese-specific adversarial patterns (Pinyin Mix, Homophones, Symbol Mix, Zero-width characters) directly into safety training and Reinforcement Learning from Human Feedback (RLHF) datasets.
- Robust Evaluation: Move beyond clean-query safety benchmarks. Utilize datasets like CSSBench to specifically evaluate and tune model performance against linguistic perturbations and obfuscation strategies common in Chinese usage.
- Fine-grained Judgment: Move away from coarse-grained blanket refusals (which lead to high over-refusal rates) and tune models to recognize the underlying semantic intent of obfuscated queries.
© 2026 Promptfoo. All rights reserved.