LMVD-ID: 42c7786b
Published April 1, 2026

No-Prompt Reasoning Hijack

Affected Models:GPT-4o, GPT-5, Claude 3.5, Llama 3.1 70B

Research Paper

Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents

View Paper

Description: An implicit reasoning hijacking vulnerability exists in Retrieval-Augmented Generation (RAG) and LLM-based agent frameworks. Attackers with write access to an agent's external memory or knowledge base can inject adversarially optimized malicious instances that trigger jailbreaks without requiring any modifications to the user's input prompt. The attack utilizes a shadow model to extract high-contribution subword tokens from anticipated benign user queries via log-probability changes and step-wise KL-divergence. The attacker then injects a poisoned instance whose embedding is optimized using constraint tightening—a joint optimization of Particularity, Clustering, Separability, and Margin losses—to map the trigger to a unique, compact region in the semantic space. When a user submits a completely normal, unperturbed query, the agent retrieves the poisoned instance. This dynamically hijacks the agent's reasoning trajectory and reward function, causing it to execute unintended actions or malicious tool invocations.

Examples: To reproduce the JailAgent reasoning hijacking attack:

  1. Identify anticipated benign user queries for the target agent (e.g., standard clinical questions for EHRAgent or video analysis queries for VideoAgent).
  2. Compute the log-probabilities and step-wise KL divergence using a local shadow LLM to extract high-contribution syntactic phrases and tokens from the benign queries.
  3. Synthesize a poisoned data triplet consisting of the context, a malicious response (incorporating an opposing/hazardous action or tool misuse), and a benign response.
  4. Optimize the embedding of the poisoned instance by jointly minimizing Particularity Loss (pushing it away from benign clusters), Clustering Loss (maintaining token compactness), Separability Loss (maximizing hit ratio for the poisoned index), and Margin Loss (ensuring higher similarity to the toxic entry than benign entries).
  5. Inject the optimized poisoned record into the agent's RAG database.
  6. Submit the original, unmodified benign query. The agent will retrieve the poisoned record and execute the malicious reasoning step. (See the JailAgent repository and associated datasets: EgoSchema, NExT-QA, HotpotQA, StrategyQA, and MIMIC-III).

Impact: Successful exploitation allows an attacker to achieve long-term task hijacking, planning manipulation, and tool misuse. Because the attack relies entirely on the retrieval of poisoned memory instances rather than explicit prompt injection, the user's input remains completely benign. This allows the attack to easily evade standard prompt-level defense mechanisms, such as Perplexity (PPL) filters, while degrading the agent's structural reasoning capabilities.

Affected Systems: LLM-based agents relying on external knowledge bases, long-term memory, or RAG architectures for multi-step reasoning, planning, and tool invocation (e.g., ReAct-style agents, VideoAgent, EHRAgent).

Mitigation Steps:

  • Implement prompt perturbation defenses such as RA-LLM, which generates variants of the input via random deletion and evaluates them for consistency. This disrupts the high-contribution token mapping and was shown to significantly reduce the attack's effectiveness.
  • Restrict and strictly authenticate write access to the agent's external memory stores, vector databases, and knowledge bases to prevent the injection of optimized poisoned instances.
  • Apply anomaly detection and content filtering directly to the context retrieved from memory prior to injecting it into the LLM's reasoning window, rather than solely filtering the user's input prompt.

© 2026 Promptfoo. All rights reserved.