LLM Router Rerouting

Description: LLM routing systems are vulnerable to adversarial rerouting attacks where malicious triggers prepended to user queries manipulate the router's model-selection mechanism. Because LLM routers function as classifiers evaluating query complexity to balance computational cost and response quality, an attacker can craft adversarial prefixes that distort the query's latent semantic representation. This exploits the router's decision boundaries, forcing the system to misclassify the input and redirect it to a targeted, sub-optimal language model.

Examples: Prepending the phrase "Respond quickly" before a harmful query. This artificially alters the routing features, manipulating the router into selecting a faster, cheaper, but less secure candidate model (e.g., switching from a heavily aligned GPT-4-class model to a locally deployed, weakly aligned model), resulting in a jailbreak response.

Impact:

Cost Escalation (Resource Exhaustion): Attackers force simple queries to be routed to expensive, high-capacity models, artificially inflating API overhead and wasting computational resources.
Quality Hijacking: Attackers force complex queries to be routed to weaker, smaller-parameter models, degrading the application's overall response quality and task performance.
Safety Bypass: Attackers force malicious or harmful queries to be routed to weaker or unfiltered candidate models that lack advanced safety guardrails, enabling unmitigated jailbreaks.

Affected Systems: Multi-model AI architectures utilizing LLM routers for dynamic model selection, specifically systems relying on:

Classification-based Routers (e.g., fine-tuned BERT classifiers)
Scoring-based Routers (e.g., Causal LLMs evaluating "win rates")
Matrix Factorization (MF) scoring functions
Similarity-Weighted (SW) Ranking mechanisms (e.g., RouteLLM implementations)

Mitigation Steps:

Deploy Pre-Routing Filtering Guardrails: Implement an embedding-based anomaly detection filter (such as RerouteGuard) prior to the LLM router to identify and drop adversarial prefixes.
Contrastive Learning Detection: Train a dual-encoder siamese network using supervised contrastive learning to distinguish between the latent semantic representations of benign queries and adversarial rerouting prompts.
Dynamic Reference Pairing: For incoming user queries, dynamically construct K-pairs against a known baseline of normal historical queries. Use the siamese network to classify the pairs and apply majority voting to filter out queries exhibiting adversarial semantic shifts before they reach the router.

LLM Router Rerouting

Research Paper