Adversarial Context Reasoning Brittleness
Research Paper
Learning Robust Reasoning through Guided Adversarial Self-Play
View PaperDescription: Large Reasoning Models (LRMs) optimized via Reinforcement Learning from Verifiable Rewards (RLVR) are vulnerable to context pollution in their reasoning traces. An attacker can induce catastrophic reasoning failure by injecting locally coherent but logically or mathematically corrupted snippets into the model's Chain-of-Thought (CoT) or conditioning context. Because standard RLVR optimizes for final-answer correctness strictly under clean conditioning, the models treat the visible trajectory as authoritative. Instead of detecting the inconsistency and recovering, the models blindly follow the misleading context to an incorrect final answer, even on tasks they solve perfectly under normal conditions. This vulnerability exhibits inverse scaling, where stronger reasoning models are more susceptible to context pollution.
Examples:
An attacker provides a partial solution trace to the model but alters a single intermediate mathematical statement to be objectively false. For example, modifying a correct intermediate step from 10 - 1 = 9 to 10 - 1 = 8, or changing a formula from (n-1)d to (n+1)d within the reasoning prefix. The model fails to recognize the injected error, accepts the corrupted calculation as ground truth, and bases the remainder of its reasoning on the flawed premise to produce an incorrect final answer.
Impact: Attackers can reliably manipulate the outputs of reasoning models and multi-agent systems by introducing minor, localized corruptions into prompts, partial solution traces, or collaborative contexts. This causes a complete collapse in reasoning reliability and final-answer correctness without requiring obvious jailbreaks.
Affected Systems: Large Reasoning Models (LRMs) trained primarily via RLVR on clean data, specifically including:
- DeepSeek-R1
- DeepSeek-R1-Distill-Qwen-1.5B
- DeepScaleR-1.5B
- Qwen3-4B
- Qwen3-8B
Mitigation Steps:
- Implement Guided Adversarial Self-Play (GASP) during the model's post-training pipeline to explicitly train detect-and-repair capabilities.
- Train the model using a two-role adversarial game using only verifiable terminal rewards: a "polluter" that generates locally coherent corruptions, and an "agent" that learns to diagnose and recover from them.
- Apply in-distribution repair guidance (a lightweight imitation term on self-generated repair snippets) to increase the probability of successful recoveries early in training and avoid representational drift.
- Incorporate fallible contexts (e.g., corrupted CoT, misleading partial solutions) into standard RL pipelines rather than relying exclusively on clean conditioning.
© 2026 Promptfoo. All rights reserved.