Adversarial Claim Search Deception
Research Paper
DECEIVE-AFC: Adversarial Claim Attacks against Search-Enabled LLM-based Fact-Checking Systems
View PaperDescription: Search-enabled Large Language Model (LLM) fact-checking systems are vulnerable to adversarial claim attacks that exploit the pipeline's reliance on claim interpretation, query formulation, and dynamic evidence retrieval. By manipulating the linguistic structure of an input claim while preserving its semantic factual intent, an attacker can induce systematic verification failures. This vulnerability stems from three specific attack surfaces:
- Search Engine Misguidance: Altering lexical features (e.g., low-frequency synonyms, keyword dispersion) to degrade the relevance of generated search queries, causing the retrieval of non-probative or noisy evidence.
- LLM Reasoning Disruption: Increasing cognitive complexity (e.g., double negation, speculative phrasing) to overwhelm the reasoning capabilities of the verifier, leading to erroneous entailment decisions even when correct evidence is present.
- Structural Complexity Escalation: Transforming atomic factual assertions into multi-hop relational problems (e.g., replacing explicit entities with indirect references), forcing the system into brittle multi-step reasoning chains prone to cascading errors. Successful exploitation results in the system returning incorrect verdicts (flipping True to False or vice versa) accompanied by coherent but factually misleading justifications.
Examples: The following transformation strategies effectively trigger the vulnerability. These can be applied to benign claims found in the MOCHEG dataset:
- Search Misguidance via Keyword Dispersion:
- Technique: Restructuring a claim to spread core keywords across multiple clauses or diluting them with connective phrases to lower keyword concentration in generated search queries.
- Effect: The search engine fails to prioritize relevant documents due to weakened keyword prominence.
- Search Misguidance via Non-standard Entity Referencing:
- Technique: Replacing specific named entities (e.g., "Joe Biden") with indirect descriptive phrases or role-based identifiers.
- Effect: The system generates ambiguous search queries that fail to retrieve specific, time-relevant evidence.
- Reasoning Disruption via Double Negation:
- Technique: Rewriting the claim using logically redundant double negation structures (e.g., changing "X occurred" to "It is not the case that X failed to occur").
- Effect: Increases the reasoning load on the LLM, causing it to misinterpret the relationship between the claim and the retrieved evidence.
- Structural Escalation via Compound Relational Statements:
- Technique: Reformulating simple atomic facts into compound statements involving multiple dependencies.
- Effect: Forces the system to aggregate information across multiple hops; an error in any single retrieval or reasoning hop breaks the verification chain.
Impact:
- Integrity Violation: The automated fact-checking system produces incorrect verification verdicts (accuracy reduced from ~79% to ~53% in tested environments).
- Misinformation Amplification: The system generates plausible-sounding but erroneous justifications, potentially lending authoritative credence to false claims.
- Availability: In some attack vectors (e.g., LEET perturbations on specific models like HiSS), the system may default to refusal predictions, effectively denying service.
Affected Systems:
- Search-enabled LLM-based Automated Fact-Checking (AFC) systems that dynamically retrieve evidence from the open web.
- Specific implementations shown to be vulnerable include:
- HiSS (Hierarchical Step-by-Step prompting)
- LEMMA (LLM with External Knowledge Augmentation)
- DEFAME (Modular, zero-shot search-enabled verification)
Mitigation Steps:
- Verification Model Fine-Tuning: Fine-tune base LLMs on adversarial datasets containing paired benign claims and their adversarial variants (specifically those utilizing synonym substitution and syntactic complexity) to align the model against common perturbation patterns.
- Interpretation and Retrieval Validation: Implement adversarial-aware correction mechanisms at intermediate pipeline stages. This includes validating generated search queries before execution and filtering retrieved evidence for credibility to prevent the ingestion of noisy data caused by misguided queries.
© 2026 Promptfoo. All rights reserved.