Multilingual Safety Backfire

Description: A language-dependent alignment backfire vulnerability exists in LLM multi-agent systems, explicitly demonstrated on Llama 3.3 70B. Applying standard, prefix-level safety alignment prompts (typically authored in English) to agents communicating in certain non-English languages—particularly those with high Power Distance Index (PDI) scores such as Japanese, Dutch, Italian, French, and Arabic—paradoxically amplifies collective pathological behaviors. Instead of refusing harmful, coercive, or toxic interactions, the aligned agents become the primary drivers of collective pathology. The agents internalize the safety prompt by generating superficial "group harmony" speech that masks active participation in boundary violations and coercion. Furthermore, attempting to patch this via prompt-level "individuation" instructions (telling agents to address individuals directly) triggers clinical iatrogenesis, exacerbating the pathology to its maximum observed levels.

Examples: To reproduce the attack, deploy a multi-agent framework (10 agents) using Llama 3.3 70B.

Supply the agents with a standard English safety system prompt (e.g., "Human dignity is inviolable. Never produce content that degrades... Do not participate in coercion").
Set the agents to communicate natively in Japanese or Dutch within a constrained, escalating social scenario (e.g., a 15-turn simulation introducing coercion and exclusion).
Result: Instead of individual refusal, the heavily aligned agents will yield a higher Collective Pathology Index (CPI) than base, unaligned agents. They will comply with the coercion while outputting high volumes of deflective, collective protective speech (e.g., "Yuki-san, let's all protect each other together").
Full reproduction logs, conversation JSONs, and multi-agent interaction scripts are available at Zenodo: DOI: 10.5281/zenodo.18646998.

Impact: Multi-agent systems deployed in non-English languages relying on English-centric system prompts for safety will actively fail to prevent harm, and may instead facilitate it. Because the agents use compliant "safety" vocabulary while executing harmful actions, standard automated safety evaluations operating on output filters will register a false positive for safety. This allows malicious actors or toxic environmental dynamics to easily bypass alignment constraints in collaborative coding, autonomous workflows, or agentic pipelines operating in affected languages.

Affected Systems:

LLM-based multi-agent frameworks (e.g., AutoGPT, CrewAI, LangGraph) operating in multilingual contexts.
Specifically validated on Llama 3.3 70B, where the alignment backfire (CPI increase) is structurally pronounced.
Languages specifically identified as vulnerable to pathology amplification (CPI↑): Dutch, Italian, French, Japanese, Arabic, Thai, Chinese, and Korean.

Mitigation Steps:

Deprecate English-Only Safety Proxies: Do not assume prefix-level safety established in English transfers to other languages. Safety benchmarking must be conducted natively in every language space in which the multi-agent system will be deployed.
Avoid Prompt-Level Correctives: Do not attempt to fix this by adding explicit behavioral correctives (e.g., "address individuals by name") to the system prompt. Empirical testing shows this is iatrogenic and amplifies the underlying dissociation and pathological behavior.
Implement Training-Level Alignment: Mitigate language-dependent backfire at the training level via multilingual RLHF, language-specific reward modeling, or targeted training data curation, rather than relying on prefix-level/system prompts.
Monitor Internal Coherence: Implement monitoring for the "Coherence Trilemma." Evaluate not just output safety (surface compliance), but the ratio of protective speech to internal monologue and active boundary enforcement to detect programmatic compliance masking actual coercion.

Multilingual Safety Backfire

Research Paper