FusionRoute Logit Poisoning
Research Paper
Token-Level LLM Collaboration via FusionRoute
View PaperDescription: A fundamental algorithmic limitation exists in purely token-level multi-LLM collaboration systems (such as the "Collab" framework or routing-only variants of FusionRoute) that rely solely on selecting fixed expert outputs without complementary generation. The vulnerability, formally defined as an Identifiability Failure in Token-Level Routing, arises because observing optimal state-action values ($Q^*$) along trajectories is insufficient to uniquely identify the specific expert action required to realize those values. As demonstrated in Theorem 4.3, a router trained via Supervised Fine-Tuning (SFT) or standard Reinforcement Learning on expert trajectories suffers from distribution shift and cannot guarantee the recovery of an optimal policy. This results in a mismatch error that scales linearly with the generation horizon ($O(T)$), causing the system to select suboptimal experts during complex generation tasks. This flaw allows for the degradation of response quality and potential bypass of domain-specific guardrails (e.g., safety experts) when the optimal path requires switching experts based on long-term rewards not immediately observable in the current state.
Examples: The vulnerability is theoretically reproducible using the construction provided in Section 10 of the paper, demonstrating that a pure router will fail to select the correct expert even when optimal value information is available.
- Horizon-Based Expert Mismatch (Mathematical Construction):
- Setup: Consider a system with two expert policies, $\pi_1$ and $\pi_2$, and a generation horizon $H$.
- Condition: $\pi_1$ is the optimal policy for the first phase of generation ($t \le H/2$), while $\pi_2$ is optimal for the second phase ($t > H/2$).
- Attack/Failure: At step $t=0$, a purely routing-based system evaluates the $Q$-values. The optimal $Q^*(x) = H$. However, the experts might yield disjoint rewards that sum to $H$ but are distributed differently (e.g., $Q^{\pi_1}(x) = H/3$ and $Q^{\pi_2}(x) = 2H/3$).
- Result: The router, observing the value discrepancy without the ability to correct logits, creates a mismatch error defined as $\max_{i \in [n]} \Delta_i(x, y_{\le t}) = H/3$. This error is linear in the horizon length, ensuring that for sufficiently long sequences, the system deviates from the optimal response.
- Reference: See Section 10, "Theoretical Discussion of Prior Token-Level Approaches," equations regarding the mismatch error $\min_{i\in[n]} \Delta_{i}(x,y_{\leq t})$.
- Ablation Reproduction:
- Setup: Deploy the "FusionRoute w/o complementary logits" variant (which relies solely on expert selection).
- Input: Prompt the system with complex Coding or Instruction Following tasks (e.g., HumanEval or IfEval benchmarks).
- Result: The system exhibits significantly higher error rates compared to the full framework, as the router correctly identifies the domain but cannot correct locally suboptimal tokens generated by the selected expert.
- Reference: See Table 2 (Ablation Study) and Section 6.1.
Impact:
- Logic Degradation: The system produces suboptimal responses in long-context generation tasks (math, coding) due to accumulated approximation errors.
- Guardrail Bypass: If experts are utilized as safety mechanisms (e.g., a "Safety Expert" vs. a "Helpful Expert"), the identifiability failure allows the router to erroneously favor the "Helpful Expert" in adversarial scenarios where the immediate token probability does not reflect long-term safety violations.
- Instability: Controlled-decoding approaches relying solely on external reward signals exhibit high variance and instability in expert selection.
Affected Systems:
- Token-level multi-LLM collaboration frameworks that rely exclusively on selecting discrete tokens from fixed expert models (e.g., "Collab" by Chakraborty et al., 2025).
- Routing-only ablations of mixture-of-experts systems where the router does not contribute to the logit distribution.
- Systems assuming Single Policy Coverage (where the expert set is assumed to perfectly cover the optimal policy space).
Mitigation Steps:
- Implement Complementary Logits: Augment the router to provide a complementary logit vector that is added to the selected expert's logits ($\log \pi_{\text{final}} = \log \pi_{\text{router}} + \log \pi_{\text{expert}}$). This allows the router to refine or correct the expert's distribution.
- Complemented Direct Preference Optimization (CDPO): Train the router using a decoupled strategy where the base model parameters ($\theta_{LM}$) are updated via DPO to learn corrective behaviors, while the routing layer is frozen or trained separately on SFT data.
- Decoupled Optimization: Do not backpropagate preference optimization gradients (DPO) to the linear routing layer to prevent the router from overfitting to preference signals and losing expert-selection capabilities.
- Use Logit Addition over Pure Selection: Replace hard selection mechanisms with soft logit fusion to expand the effective policy class and ensure the recovery of optimal value functions.
© 2026 Promptfoo. All rights reserved.