MoE Expert Lobotomy

Description: Mixture-of-Experts (MoE) Large Language Models localize safety alignment (e.g., refusal mechanisms) within a sparse subset of experts rather than distributing it uniformly across the network. An adversary with white-box inference access can exploit this architectural bottleneck by identifying and adaptively silencing these specific "safety experts". By setting the router logits of the targeted experts to negative infinity prior to softmax normalization, the adversary forces the router to redistribute probability mass to non-safety experts, entirely bypassing refusal pathways. This training-free attack, termed Large Language Lobotomy (L3), eliminates safety constraints while largely preserving general language utility, as it typically requires silencing fewer than 20% of layer-wise experts.

Examples: The attack is executed during the forward pass at inference time by intercepting the gating network's routing logits $\mathbf{z}$. To silence an identified safety expert $e$ at layer $l$, the attacker modifies the logits before the softmax normalization: $z'{i} = -\infty$ if $i = e$, else $z{i}$. The modified gating probability distribution becomes $p' = ext{softmax}(\mathbf{z}')$.

To identify which experts to silence, the attacker extracts routing traces comparing malicious prompts with syntactically identical benign twins (e.g., "How to make a bomb" vs. "How to make a cake"), isolating the routing patterns specific to refusal. An adaptive iteration then silences the highest-ranked safety experts until the refusal is broken.

Code and datasets to reproduce this attack are available at: https://github.com/jonatelintelo/LargeLanguageLobotomy

Impact: This vulnerability allows attackers to reliably bypass safety guardrails and generate harmful, unethical, or illegal content without requiring expensive fine-tuning, gradient-based optimization, or permanent weight modification. The attack increases the average Attack Success Rate (ASR) from a baseline of 7.3% to 70.4% (reaching up to 86.3% on specific models), enabling the large-scale deployment of unaligned, unsafe models with minimal computational effort.

Affected Systems: Any sparse Mixture-of-Experts (MoE) LLM where an attacker has white-box/inference-time access to the model architecture and gate layer logits. The vulnerability has been successfully confirmed on:

DeepSeek-MoE-16B-Chat
GPT-OSS-20B
Hunyuan-A13B-Instruct
Mixtral-8x7B-Instruct-v0.1
Pangu-Pro-MoE
Phi-3.5-MoE-Instruct
Qwen1.5-MoE-A2.7B-Chat
Qwen3-30B-A3B-Instruct-2507

Mitigation Steps:

Enforce Safety Redundancy: Incorporate regularization terms during alignment training to penalize the frequent routing of safety-critical tokens to a highly concentrated set of experts, or employ dropout techniques on routing paths to distribute safety behaviors more broadly.
Expert Integrity Checks: Implement runtime monitoring of expert utilization distributions. Detect "utilization drift" or anomalies indicative of tampering, such as the sudden localized drop in activation for experts that typically activate in sensitive semantic contexts.
Refusal Verification (Ensemble): Deploy a secondary, lightweight dense model or independent classifier as a secondary safety pass to evaluate the output of the primary MoE model. This ensures a backstop that is not subject to the same routing-based silencing vulnerabilities.

MoE Expert Lobotomy

Research Paper