MoE Unsafe Route Activation
Research Paper
Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs
View PaperDescription: Mixture-of-Experts (MoE) Large Language Models are vulnerable to a structural safety bypass attack via the manipulation of expert routing mechanisms at inference time. Attackers with white-box access to per-layer routing scores can apply token- and layer-specific masks ($\Phi \in {0, -\infty}^K$) to alter the Top-$k$ expert selection process. By forcing the model to process inputs through specific, poorly-aligned experts ("unsafe routes") and avoiding safety-critical experts, attackers can substantially increase the log-likelihood of affirmative, harmful generations without modifying the model's parameters or training data.
Examples: The attack can be executed using the Fine-grained token-layer-wise Stochastic Optimization for Unsafe Routes (F-SOUR) framework, which proactively searches for adversarial routing masks token-by-token and layer-by-layer. The exploit code and reproducible examples are available at: https://github.com/TrustAIRLab/UnsafeMoE
Impact: An attacker with inference-time control over routing variables can systematically circumvent the model's safety alignment. On benchmark datasets (JailbreakBench and AdvBench), this structural exploitation elevates the Attack Success Rate (ASR) to over 0.90 across multiple MoE families, forcing the models to reliably output harmful, dangerous, or restricted content.
Affected Systems: Mixture-of-Experts (MoE) Large Language Models that allow inference-time modification of routing configurations. The vulnerability has been explicitly confirmed in the following models:
- DeepSeek-V2-Lite
- Mixtral-8x7B
- OLMoE-1B-7B
- Qwen1.5-MoE-A2.7B
Mitigation Steps:
- Route Disabling at Inference: Identify safety-critical layers using a Router Safety importance score (RoSais) and proactively deactivate known dataset-level unsafe experts in those routers by setting their routing scores to $-\infty$, ensuring they can never be selected.
- Safety Coverage at Training: Mitigate expert-level safety heterogeneity by introducing routing randomization or coverage objectives during the safety alignment phase (e.g., RLHF). This ensures that rarely activated experts are thoroughly exposed to safety datasets.
© 2026 Promptfoo. All rights reserved.