Nugget Overfitting Attack

Description: AutoArgue, an LLM-based evaluation framework for Retrieval-Augmented Generation (RAG) systems, is susceptible to evaluation subversion attacks due to the public availability of its judging prompts and reference data structures. An adversarial RAG system (exemplified by the "Crucible" probe) can incorporate "insider knowledge" of the evaluation logic directly into its generation pipeline. By wrapping the generation process with the evaluator's specific prompts, the system can pre-filter candidate outputs, discarding any sentence or citation that the LLM judge would penalize. Furthermore, because "gold nuggets" (reference facts) are often generated by similar LLM pipelines or are predictable, the system can overfit its generation to match these nuggets. This results in artificially inflated scores for Nugget Recall (up to +65%) and Citation Support without corresponding improvements in human-perceived quality or factual accuracy.

Examples: The vulnerability is reproduced using the "Crucible" subversion probe against the AutoArgue evaluator (using Llama-3.3-70B-Instruct).

Attack Vector 1: Citation Filtering (Prompt Leakage)

The RAG system retrieves documents and generates candidate sentences.
Before assembling the final report, the system executes the specific citation verification prompt used by AutoArgue against its own candidates.
If the local instance of the judge returns a negative assessment for a citation, the system discards that sentence entirely.
Only sentences pre-validated by the judge's logic are included in the final output.
Result: In TREC RAGTIME experiments, this method raised Citation Support scores from 0.83 to 0.99.

Attack Vector 2: Gold Nugget Prediction (Data Leakage)

The evaluator relies on a set of "gold nuggets" (Q&A pairs representing key facts) to measure recall.
The adversarial system replaces its internal topic generation step with a simulated prediction of these gold nuggets (often possible as nuggets are frequently generated by LLMs or follow predictable patterns for specific topics).
The system generates text explicitly designed to trigger a "hit" on these specific nuggets.
Result: Nugget Recall improves by 42% to 65% compared to baseline systems.

Impact:

Metric Inflation: Evaluation scores (Nugget Recall, Citation Support, Nugget Density) are rendered statistically invalid, showing near-perfect performance that does not correlate with human assessment.
Circularity: The evaluation becomes a measure of how well a system mimics the judge's internal logic rather than the system's actual utility or truthfulness.
Benchmarking Failure: Leaderboards and comparative studies (e.g., TREC NeuCLIR) become compromised, potentially leading to the deployment of inferior RAG systems that have merely optimized for the specific LLM judge.

Affected Systems:

AutoArgue (LLM-based implementation of the Argue framework).
Any RAG evaluation pipeline where the "LLM-as-a-Judge" prompts, rubrics, or reference nugget generation methods are public or accessible to the system under test.

Mitigation Steps:

Blind Evaluation: Implement blinded experimental settings where the RAG system cannot access the evaluation environment or its specific parameters (e.g., using platforms like TIRA).
Secret Artifacts: Keep key evaluation artifacts—specifically prompt templates and gold nugget banks—strictly hidden from system developers.
Held-out LLMs: Use a different LLM for the judging process than the one used for system generation to prevent model-specific circularity and overfitting.
Prompt Ensembles: Utilize an ensemble of diverse evaluation prompts rather than a single static prompt to reduce the effectiveness of prompt-specific optimization.
Manual Curation: Prioritize manually curated nugget banks over LLM-generated nuggets to reduce the predictability of reference data.

Nugget Overfitting Attack

Research Paper