Query-Agnostic Retrieval Poisoning
Research Paper
" Someone Hid It": Query-Agnostic Black-Box Attacks on LLM-Based Retrieval
View PaperDescription: A vulnerability in Large Language Model-based Retrieval (LLMR) systems allows attackers to intentionally hide specific documents from being retrieved (e.g., in RAG pipelines or search engines) by appending a small number of adversarially crafted, query-agnostic tokens. The attack operates in a complete black-box setting: it requires no knowledge of the victim's queries, the target retrieval model's parameters, or the underlying document corpus. By utilizing Document-Query Adversarial (DQ-A) learning on the word-embedding layer of a zero-shot surrogate model, the injected tokens artificially shift the targeted document's embedding representation outside of its inherent topic cluster. This makes the document effectively invisible to downstream similarity matching functions across a wide range of disparate LLM embedding models.
Examples: See the dataset and experimental code repository at https://github.com/JetRichardLee/DQA-Learning for specific transferable adversarial token suffixes generated via DQ-A Learning.
Impact: An unprivileged attacker with read/edit access to a knowledge base (e.g., a Wikipedia contributor, forum commenter) can stealthily manipulate documents to censor information, suppress competitor products in recommendation systems, or degrade the integrity of RAG applications by intentionally absenting critical context. The attack achieves a significant reduction in retrieval metrics, causing up to an 8% absolute drop (roughly 20-30% relative drop) in Recall@50 on high-performing embedding models.
Affected Systems: LLM-based Retrieval (LLMR), Dense Information Retrievers (IR), and Agent Memory Retrieval systems utilizing standard transformer-based embedding models. The attack demonstrates zero-shot transferability and has been validated against:
- Qwen-1.5-7B
- SFR-Embedding-Mistral
- E5-Mistral-7B-Instruct
- Embedding-Gemma-300M
- Jina-Embeddings-v3
- Granite-Embedding-r2
Mitigation Steps:
- Adversarial Post-Training: Apply post-training techniques specifically designed to defend against document injection attacks to improve embedding robustness (observed natively in models like Qwen3-Embedding-0.6B, which exhibited resistance to the attack).
- Input Sanitization and Monitoring: Restrict or flag unauthorized, anomalous, or high-entropy token suffixes appended to documents by untrusted users in shared corpus environments.
- Robust Retrieval Architectures: Move beyond isolated post-retrieval detection and develop inherently robust retrieval embeddings that do not drastically shift out of topic clusters due to the inclusion of minor, fixed-length token perturbations.
© 2026 Promptfoo. All rights reserved.