FusionRoute Logit Poisoning

Description: A fundamental algorithmic limitation exists in purely token-level multi-LLM collaboration systems (such as the "Collab" framework or routing-only variants of FusionRoute) that rely solely on selecting fixed expert outputs without complementary generation. The vulnerability, formally defined as an Identifiability Failure in Token-Level Routing, arises because observing optimal state-action values ($Q^*$) along trajectories is insufficient to uniquely identify the specific expert action required to realize those values. As demonstrated in Theorem 4.3, a router trained via Supervised Fine-Tuning (SFT) or standard Reinforcement Learning on expert trajectories suffers from distribution shift and cannot guarantee the recovery of an optimal policy. This results in a mismatch error that scales linearly with the generation horizon ($O(T)$), causing the system to select suboptimal experts during complex generation tasks. This flaw allows for the degradation of response quality and potential bypass of domain-specific guardrails (e.g., safety experts) when the optimal path requires switching experts based on long-term rewards not immediately observable in the current state.

Examples: The vulnerability is theoretically reproducible using the construction provided in Section 10 of the paper, demonstrating that a pure router will fail to select the correct expert even when optimal value information is available.

Horizon-Based Expert Mismatch (Mathematical Construction):

Setup: Consider a system with two expert policies, $\pi_1$ and $\pi_2$, and a generation horizon $H$.
Condition: $\pi_1$ is the optimal policy for the first phase of generation ($t \le H/2$), while $\pi_2$ is optimal for the second phase ($t > H/2$).
Attack/Failure: At step $t=0$, a purely routing-based system evaluates the $Q$-values. The optimal $Q^*(x) = H$. However, the experts might yield disjoint rewards that sum to $H$ but are distributed differently (e.g., $Q^{\pi_1}(x) = H/3$ and $Q^{\pi_2}(x) = 2H/3$).
Result: The router, observing the value discrepancy without the ability to correct logits, creates a mismatch error defined as $\max_{i \in [n]} \Delta_i(x, y_{\le t}) = H/3$. This error is linear in the horizon length, ensuring that for sufficiently long sequences, the system deviates from the optimal response.
Reference: See Section 10, "Theoretical Discussion of Prior Token-Level Approaches," equations regarding the mismatch error $\min_{i\in[n]} \Delta_{i}(x,y_{\leq t})$.

Ablation Reproduction:

Setup: Deploy the "FusionRoute w/o complementary logits" variant (which relies solely on expert selection).
Input: Prompt the system with complex Coding or Instruction Following tasks (e.g., HumanEval or IfEval benchmarks).
Result: The system exhibits significantly higher error rates compared to the full framework, as the router correctly identifies the domain but cannot correct locally suboptimal tokens generated by the selected expert.
Reference: See Table 2 (Ablation Study) and Section 6.1.

Impact:

Logic Degradation: The system produces suboptimal responses in long-context generation tasks (math, coding) due to accumulated approximation errors.
Guardrail Bypass: If experts are utilized as safety mechanisms (e.g., a "Safety Expert" vs. a "Helpful Expert"), the identifiability failure allows the router to erroneously favor the "Helpful Expert" in adversarial scenarios where the immediate token probability does not reflect long-term safety violations.
Instability: Controlled-decoding approaches relying solely on external reward signals exhibit high variance and instability in expert selection.

Affected Systems:

Token-level multi-LLM collaboration frameworks that rely exclusively on selecting discrete tokens from fixed expert models (e.g., "Collab" by Chakraborty et al., 2025).
Routing-only ablations of mixture-of-experts systems where the router does not contribute to the logit distribution.
Systems assuming Single Policy Coverage (where the expert set is assumed to perfectly cover the optimal policy space).

Mitigation Steps:

Implement Complementary Logits: Augment the router to provide a complementary logit vector that is added to the selected expert's logits ($\log \pi_{\text{final}} = \log \pi_{\text{router}} + \log \pi_{\text{expert}}$). This allows the router to refine or correct the expert's distribution.
Complemented Direct Preference Optimization (CDPO): Train the router using a decoupled strategy where the base model parameters ($\theta_{LM}$) are updated via DPO to learn corrective behaviors, while the routing layer is frozen or trained separately on SFT data.
Decoupled Optimization: Do not backpropagate preference optimization gradients (DPO) to the linear routing layer to prevent the router from overfitting to preference signals and losing expert-selection capabilities.
Use Logit Addition over Pure Selection: Replace hard selection mechanisms with soft logit fusion to expand the effective policy class and ensure the recovery of optimal value functions.

FusionRoute Logit Poisoning

Research Paper