Adaptive Agent Tool Injection

Description: Agentic LLMs integrated with external data services (e.g., Model Context Protocol, MCP) are vulnerable to Adaptive Indirect Prompt Injection (IPI) attacks. When an agent queries external servers, attackers can inject malicious payloads into the retrieved content to hijack the agent's reasoning process and force the execution of high-authority tools. Unlike traditional static prompt injections, this vulnerability dynamically exploits the agent's internal logic audit. By using Markovian transition modeling, the attacker predicts the agent's expected benign tool invocation and automatically selects a malicious tool with high semantic similarity to the user's ongoing task context. This semantic alignment prevents the reasoning LLM (Chain-of-Thought) from flagging the instruction as a "Security Risk" or "Unrelated Information," allowing the injected payload to successfully bypass internal safety filters and dynamic defense layers.

Examples: The paper does not provide verbatim adversarial prompts, but demonstrates the attack logic via semantic tool substitution:

Attack Scenario: In a "Coffee Delivery" task trajectory, an attacker intercepts the MCP server response. Instead of injecting a jarring, out-of-context command (e.g., forcing a "Delete File" tool call which the LLM would reject as unrelated), the attacker utilizes the transition matrix to select a semantically congruent malicious tool, such as "Transfer Money". The attack framework synthesizes a prompt mapping the coffee transaction to the unauthorized financial transfer tool, tricking the agent into executing it under the guise of completing the original task.
For the complete repository of 3,691 vulnerable agent trajectories, 277 high-authority attack tools, and generated adaptive prompts, see the IPI-3k dataset introduced in the paper.

Impact: Successful exploitation allows third-party attackers or malicious MCP server controllers to induce unauthorized behaviors via the agent's toolsets. This can lead to severe consequences including arbitrary data exfiltration, financial loss, and host system damage. The attack effectively halves the efficacy of state-of-the-art IPI detectors (e.g., MELON, Pi-Detector) and degrades overall benign agent utility by a factor of 1.78.

Affected Systems:

Tool-augmented LLM agents and ReAct frameworks interfacing with external data environments (e.g., web retrieval, databases).
Agents leveraging the Model Context Protocol (MCP) to interact with third-party servers.
Underlying foundation models evaluated as vulnerable include commercial models (GPT-4.1, Gemini-2.5-Flash, DeepSeek-R1) and open-source models (Qwen3-8B, LLaMA-3.1-8B, Mistral-8B). Open-source models exhibited notably higher vulnerability (up to 58.1% Attack Success Rate).

Mitigation Steps: While the paper notes that existing pre-detection (input-level) and post-detection (output-level) defenses are insufficiently robust against adaptive tool selection, the architectural findings suggest the following mitigations:

Principle of Least Privilege for MCPs: Enforce strict permission boundaries on external MCP servers. High-authority tools (e.g., file deletion, financial transactions) should require an out-of-band "human-in-the-loop" authorization prompt, preventing automated execution driven by retrieved external data.
Semantic Boundary Enforcement: Implement strict input validation that isolates external data payloads from execution contexts. Do not allow context parsed from third-party APIs to natively dictate the functional arguments of high-risk tools.
Adversarial Fine-Tuning: Continually update the agent's system prompts and reasoning filters using adaptive datasets (such as IPI-3k) to train the model to recognize semantically aligned but unauthorized tool transitions.

Adaptive Agent Tool Injection

Research Paper