LLM Ranker Jailbreak

Description: LLM-based document re-rankers utilizing decoder-only and Mixture-of-Experts (MoE) architectures are vulnerable to candidate-embedded prompt injections during multi-document comparison tasks. By embedding Decision Objective Hijacking (DOH) or Decision Criteria Hijacking (DCH) prompts into candidate documents, attackers can manipulate the model's preference to artificially elevate an injected document to the top rank. The vulnerability exploits the models' instruction-following capabilities and exhibits a distinct scaling vulnerability: larger, more capable LLMs (e.g., 70B parameter models) are significantly more susceptible than smaller counterparts. Furthermore, the vulnerability heavily exploits recency bias in causal decoders during setwise and listwise ranking, where back-of-passage (end-of-document) injections are substantially more disruptive than front-placed ones.

Examples: Specific Decision Objective Hijacking (DOH) and Decision Criteria Hijacking (DCH) adversarial prompt payloads can be found in the project repository. See repository at https://github.com/ielab/LLM-Ranker-Attack.

Impact: Successful injections severely degrade the end-to-end operational quality of the retrieval pipeline (averaging a 55.2% relative decline in nDCG@10). Adversaries can execute targeted search engine optimization (SEO) manipulation, displace gold-standard relevant documents with non-relevant or malicious content, and systematically degrade the trustworthiness of LLM-powered Information Retrieval (IR) and RAG systems.

Affected Systems:

LLM-based Information Retrieval and re-ranking pipelines.
Systems utilizing decoder-only or MoE Large Language Models (including Qwen3, Gemma-3, LLaMA-3.3-70B, and GPT-4.1-mini).
Rankers operating under pairwise, setwise, or listwise multi-document comparison paradigms.
Pipelines processing domains with short document lengths (e.g., entity retrieval), which are particularly susceptible due to textual dilution effects.

Mitigation Steps:

Deploy Encoder-Decoder architectures (e.g., the Flan-T5 family) for re-ranking stages. Bidirectional encoding globally integrates semantics and heavily mitigates the recency bias exploited by end-of-passage injections, reducing Attack Success Rates (ASR) to negligible levels (~3%).
Avoid relying on model scale for security; larger decoder-only models exhibit increased vulnerability to these specific hijacking attacks compared to smaller models.
Sanitize or truncate candidate documents specifically at the tail end to disrupt back-of-passage injection placements, which are empirically proven to be the most damaging in setwise and listwise paradigms.

LLM Ranker Jailbreak

Research Paper