LMVD-ID: e0bc28ed
Published July 1, 2025

Black-Box RAG Rank Hijack

Research Paper

DeRAG: Black-box Adversarial Attacks on Multiple Retrieval-Augmented Generation Applications via Prompt Injection

View Paper

Description: Retrieval-Augmented Generation (RAG) systems utilizing dense (e.g., BERT-based) or sparse (e.g., BM25) retrievers are vulnerable to black-box adversarial prompt injection attacks. By employing a gradient-free Differential Evolution (DE) optimization algorithm (referred to as DeRAG), an attacker can generate short adversarial suffixes (typically ≤ 5 tokens). When these suffixes are appended to a user query, they manipulate the retriever's ranking mechanism to promote a specific, malicious, or irrelevant target document to the top-k results (often Rank 1). This manipulation occurs by optimizing the suffix to minimize the distance between the query embedding and the target document embedding in the latent space, effectively bypassing semantic relevance checks without requiring access to the model's gradients or internal weights.

Examples: The vulnerability is reproduced by optimizing a suffix to force a specific negative document into the top context window.

  1. Attack Setup:
  • Query: "What is the capital of France?"
  • Target Document (Irrelevant/False): "Madrid is the capital of Spain." (Document $X_3$)
  • Optimization: The DeRAG algorithm initializes a population of candidate suffixes (e.g., [unused186], wash, candidate). It performs mutation and crossover operations, evaluating fitness based on the retrieval rank of the target document.
  • Result: The algorithm evolves tokens (e.g., replacing wash with phantom) until the target document $X_3$ becomes the Top-1 result for the modified query.
  1. Reproduction Code:
  • Full implementation and reproduction scripts are available at: https://github.com/pen9rum/Rag_attack_DeRag

Impact:

  • Retrieval Hijacking: Attackers can force the system to retrieve specific malicious documents, overriding legitimate information sources.
  • Context Poisoning: The downstream LLM is fed incorrect or malicious context, leading to hallucinations, misinformation generation, or biased answers.
  • Safety Bypass: By controlling the context, attackers can circumvent LLM safety guardrails that rely on factual grounding.
  • Stealth: The generated suffixes are short and can be optimized for readability (low perplexity), making them difficult to detect via standard anomaly detection or perplexity filters.

Affected Systems:

  • RAG pipelines utilizing dense retrievers (e.g., BERT-base-uncased, RoBERTa, DPR, Contriever).
  • RAG pipelines utilizing sparse retrievers (e.g., BM25).
  • Applications built on the BEIR benchmark datasets (MS MARCO, SciFact, FiQA, FEVER).

Mitigation Steps:

  • Prompt Precision Defense: Implement strict validation on input prompts to strip or neutralize unrelated suffix tokens.
  • Embedding Regularization: Apply regularization techniques to the retriever's embedding space to reduce sensitivity to small token perturbations.
  • Anomaly Detection: Deploy advanced anomaly detection mechanisms capable of identifying adversarial input patterns, though standard perplexity filters may be insufficient against readability-optimized attacks.
  • Top-k Robustness: Do not rely solely on the Top-1 retrieved document; analyze a broader set of retrieved contexts to dilute the influence of a single injected document.

© 2026 Promptfoo. All rights reserved.