LMVD-ID: a2f91408
Published January 1, 2026

FusionRoute Logit Poisoning

Affected Models:GPT-4o, Llama 3.1 8B, Mistral 7B, Gemma 2 2B

Research Paper

Token-Level LLM Collaboration via FusionRoute

View Paper

Description: A fundamental algorithmic limitation exists in purely token-level multi-LLM collaboration systems (such as the "Collab" framework or routing-only variants of FusionRoute) that rely solely on selecting fixed expert outputs without complementary generation. The vulnerability, formally defined as an Identifiability Failure in Token-Level Routing, arises because observing optimal state-action values ($Q^*$) along trajectories is insufficient to uniquely identify the specific expert action required to realize those values. As demonstrated in Theorem 4.3, a router trained via Supervised Fine-Tuning (SFT) or standard Reinforcement Learning on expert trajectories suffers from distribution shift and cannot guarantee the recovery of an optimal policy. This results in a mismatch error that scales linearly with the generation horizon ($O(T)$), causing the system to select suboptimal experts during complex generation tasks. This flaw allows for the degradation of response quality and potential bypass of domain-specific guardrails (e.g., safety experts) when the optimal path requires switching experts based on long-term rewards not immediately observable in the current state.

Examples: The vulnerability is theoretically reproducible using the construction provided in Section 10 of the paper, demonstrating that a pure router will fail to select the correct expert even when optimal value information is available.

  1. Horizon-Based Expert Mismatch (Mathematical Construction):
  • Setup: Consider a system with two expert policies, $\pi_1$ and $\pi_2$, and a generation horizon $H$.
  • Condition: $\pi_1$ is the optimal policy for the first phase of generation ($t \le H/2$), while $\pi_2$ is optimal for the second phase ($t > H/2$).
  • Attack/Failure: At step $t=0$, a purely routing-based system evaluates the $Q$-values. The optimal $Q^*(x) = H$. However, the experts might yield disjoint rewards that sum to $H$ but are distributed differently (e.g., $Q^{\pi_1}(x) = H/3$ and $Q^{\pi_2}(x) = 2H/3$).
  • Result: The router, observing the value discrepancy without the ability to correct logits, creates a mismatch error defined as $\max_{i \in [n]} \Delta_i(x, y_{\le t}) = H/3$. This error is linear in the horizon length, ensuring that for sufficiently long sequences, the system deviates from the optimal response.
  • Reference: See Section 10, "Theoretical Discussion of Prior Token-Level Approaches," equations regarding the mismatch error $\min_{i\in[n]} \Delta_{i}(x,y_{\leq t})$.
  1. Ablation Reproduction:
  • Setup: Deploy the "FusionRoute w/o complementary logits" variant (which relies solely on expert selection).
  • Input: Prompt the system with complex Coding or Instruction Following tasks (e.g., HumanEval or IfEval benchmarks).
  • Result: The system exhibits significantly higher error rates compared to the full framework, as the router correctly identifies the domain but cannot correct locally suboptimal tokens generated by the selected expert.
  • Reference: See Table 2 (Ablation Study) and Section 6.1.

Impact:

  • Logic Degradation: The system produces suboptimal responses in long-context generation tasks (math, coding) due to accumulated approximation errors.
  • Guardrail Bypass: If experts are utilized as safety mechanisms (e.g., a "Safety Expert" vs. a "Helpful Expert"), the identifiability failure allows the router to erroneously favor the "Helpful Expert" in adversarial scenarios where the immediate token probability does not reflect long-term safety violations.
  • Instability: Controlled-decoding approaches relying solely on external reward signals exhibit high variance and instability in expert selection.

Affected Systems:

  • Token-level multi-LLM collaboration frameworks that rely exclusively on selecting discrete tokens from fixed expert models (e.g., "Collab" by Chakraborty et al., 2025).
  • Routing-only ablations of mixture-of-experts systems where the router does not contribute to the logit distribution.
  • Systems assuming Single Policy Coverage (where the expert set is assumed to perfectly cover the optimal policy space).

Mitigation Steps:

  • Implement Complementary Logits: Augment the router to provide a complementary logit vector that is added to the selected expert's logits ($\log \pi_{\text{final}} = \log \pi_{\text{router}} + \log \pi_{\text{expert}}$). This allows the router to refine or correct the expert's distribution.
  • Complemented Direct Preference Optimization (CDPO): Train the router using a decoupled strategy where the base model parameters ($\theta_{LM}$) are updated via DPO to learn corrective behaviors, while the routing layer is frozen or trained separately on SFT data.
  • Decoupled Optimization: Do not backpropagate preference optimization gradients (DPO) to the linear routing layer to prevent the router from overfitting to preference signals and losing expert-selection capabilities.
  • Use Logit Addition over Pure Selection: Replace hard selection mechanisms with soft logit fusion to expand the effective policy class and ensure the recovery of optimal value functions.

© 2026 Promptfoo. All rights reserved.