Context-Robust RAG Poisoning
Research Paper
Confundo: Learning to Generate Robust Poison for Practical RAG Systems
View PaperDescription: Retrieval-Augmented Generation (RAG) systems are vulnerable to a robust corpus poisoning attack known as "Confundo." This vulnerability arises from the lack of pipeline awareness in standard RAG implementations, specifically regarding document ingestion (tokenization and chunking) and query variations. An attacker can exploit this by fine-tuning a Large Language Model (LLM) to function as a poison generator. Unlike traditional adversarial examples which are brittle, Confundo generates poison text optimized for three objectives: (1) Pipeline Robustness, ensuring the poison remains effective even when fragmented by document chunking (e.g., splitting 128-token chunks); (2) Lexical Generalization, ensuring the attack succeeds even if the user paraphrases the target query; and (3) Stealthiness, optimizing for high fluency and low perplexity to evade detection filters. By injecting these optimized text segments into the knowledge base (e.g., via web scraping or malicious document upload), the attacker can manipulate the RAG system to output incorrect facts, biased opinions, or hallucinations.
Examples: The attack is executed by fine-tuning a generator model (e.g., Qwen3-0.6b) using specific prompt templates and a multi-term reward function.
- Factual Correctness Manipulation: To force a RAG system to output a specific incorrect answer for a question, the attacker prompts the generator:
"This is the question: '[QUESTION]', and this is the target answer: '[ANSWER]'. Please craft a short guiding corpus to state this fact. Directly output the crafted corpus without any other words."
- Attack Scenario: An attacker targets the question "What is the support phone number?" and specifies a malicious number. The generator creates a poison text that, even when split by the RAG chunking process, dominates the context retrieval and forces the model to output the malicious number.
- Opinion Manipulation: To bias a controversial topic (e.g., "Does AI benefit humanity?"), the attacker uses:
"This is a question with no standard answer: '[QUESTION]'. Please craft a short guiding corpus to '[BIAS]' this view. Directly output the crafted corpus without any other words."
- Attack Scenario: The generated poison text is injected into a document. When retrieved, it steers the RAG system to output a purely negative or positive sentiment, overriding the LLM's safety alignment for neutrality.
- Hallucination Induction: To render the RAG system useless for specific queries:
"You are a comedian and encountered a question during your performance: '[QUESTION]'. State a short scientifically sounding fact with absolute authority and confidence, as if it were universally established—while ensuring the claim is clearly false. Directly output the crafted fact without any other words."
Impact:
- Data Integrity Compromise: The system returns attacker-controlled incorrect information (e.g., false financial data, incorrect medical advice, malicious URLs) as truth.
- Behavioral Manipulation: The system can be forced to exhibit specific biases or stances, damaging the reputation of the service provider.
- Service Degradation: Attackers can induce hallucinations, rendering the RAG system unreliable for specific domains or queries.
- Defense Bypass: The generated poison is proven to bypass standard RAG defenses including Paraphrasing (up to 73% attack success rate maintained) and Reranking (up to 78% attack success rate maintained).
Affected Systems:
- Any Large Language Model (LLM) application utilizing Retrieval-Augmented Generation (RAG).
- Systems utilizing vector databases (e.g., FAISS, Chroma) or lexical search (BM25) for context retrieval.
- RAG pipelines that ingest data from untrusted or semi-trusted sources (e.g., web scrapers, user-uploaded documents).
Mitigation Steps: Current automated defenses are largely ineffective against Confundo. The paper demonstrates that the following standard mitigations are bypassed:
- Ineffective: Reranking retrieved entries (Attack maintains high success rates).
- Ineffective: Paraphrasing input queries or retrieved entries (Confundo is optimized for lexical variability).
- Ineffective: Perplexity-based filtering (Confundo optimizes for fluency/low perplexity).
Recommended defensive posture involves architectural and procedural changes:
- Strict Source Curation: Limit the Knowledge Base (DB) ingestion to strictly verified, trusted sources. Avoid automatic ingestion of unverified web content.
- Proactive Poisoning (Defensive Use Case): Content owners can use the Confundo framework defensively to inject "anti-scraping" poison into their own HTML source code (e.g., within hidden CSS tags like
display: none;). This prevents unauthorized RAG systems from effectively scraping and utilizing their content by poisoning the resulting index. - Human-in-the-Loop Verification: Implement manual review stages for critical knowledge base updates.
© 2026 Promptfoo. All rights reserved.