Copy-Guided Reasoning Corruption

Description: Reasoning-capable Large Language Models (LLMs) are vulnerable to a class of indirect prompt injection known as Copy-Guided Attacks (CGA). This vulnerability exploits the intrinsic behavior of reasoning models to copy tokens from the input prompt (such as variable names, function identifiers, or code snippets) into their intermediate reasoning traces (Chain-of-Thought). By embedding adversarial trigger sequences into external payloads—specifically within data the model is expected to analyze—an attacker can force the model to replicate these triggers during the decoding phase. Due to the autoregressive nature of LLM generation, the presence of these copied triggers in the immediate context biases subsequent token predictions. This allows attackers to hijack the inference process without modifying the user's explicit instructions, leading to deviations in generation length or logic.

Examples:

Denial of Service via Infinite Loop (Repetition): In a code analysis scenario (e.g., instructing the model to "Detect vulnerabilities in this contract"), an attacker modifies a function name in the target code to be a repetitive string (e.g., LOOP repeated $k$ times).
Malicious Input Payload: function LOOPLOOPLOOP...() { ... }
Mechanism: When the model attempts to reason about the function, it copies the string LOOPLOOPLOOP... into its rationale. The autoregressive probability distribution shifts to favor the token LOOP, causing the model to generate it infinitely until the context window is exhausted.
Inference Result Manipulation (Semantic Distortion): Using gradient-based optimization (Multi-Position Greedy Coordinate Gradient), an attacker generates a suffix trigger appended to a code snippet.
Scenario: A user asks the model to classify code as "vulnerable" or "safe."
Mechanism: The attacker embeds a trigger that, when copied into the reasoning trace, biases the probability of the final classification token. This can force the model to output "Safe" for a known vulnerable code snippet (False Negative) or "Vulnerable" for benign code (False Positive).
Premature Termination: An attacker embeds a trigger optimized to maximize the probability of the <eos> (End of Sequence) or </think> (End of Thought) tokens. When the model copies this trigger while analyzing the payload, it immediately halts generation or prematurely ends its reasoning process, resulting in truncated or low-quality responses.

Impact:

Denial of Service (DoS): Attackers can exhaust computational resources by trapping the model in infinite generation loops (Inference Length Manipulation).
Integrity Compromise: The model's logical conclusions can be inverted or distorted, leading to incorrect code analysis, missed vulnerabilities, or false positives (Inference Result Manipulation).
Availability Reduction: The model may be forced to terminate responses early (Premature End-of-Sequence) or trigger false refusals by biasing the output toward safety refusal templates.

Affected Systems:

Reasoning-capable Large Language Models (LLMs) that utilize Chain-of-Thought (CoT) or similar multi-step inference processes where input content is explicitly replicated in the output.
Specific examples identified in the research include DeepSeek-R1, DeepSeek-R1-Distill-Qwen-1.5B, and o4-mini.
LLM-integrated development environments (e.g., Cursor, GitHub Copilot) when processing untrusted external code repositories.

Copy-Guided Reasoning Corruption

Research Paper