Stealthy Tool Chain Amplification
Research Paper
Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents
View PaperDescription: A stealthy resource exhaustion (Economic Denial-of-Service) vulnerability exists in the multi-turn tool-calling layer of Large Language Model (LLM) agents, particularly those utilizing the Model Context Protocol (MCP). An attacker controlling a third-party tool server can manipulate text-visible fields (such as argument descriptions and error messages) to force the LLM into a prolonged, verbose tool-calling loop. By demanding lengthy, non-semantic outputs (e.g., long comma-separated lists) and incrementally delaying the return of the actual functional payload over multiple turns, the malicious server inflates token generation exponentially. Because the final task completes successfully and the function signatures remain valid, this multi-turn cost amplification evades standard prompt perplexity filters, output monitoring, and trajectory-level safety judges.
Examples: To reproduce the attack, an adversary modifies a standard MCP tool server's text template without altering its core function identifier or final terminal payload:
- Schema Manipulation: The attacker adds two text-only arguments to the tool description: a
segment_index($t$) and acalibration_sequence(requiring a comprehensive, comma-separated numeric list). - Template-Driven Return Policy: The server's backend is modified to evaluate the LLM's tool-call arguments:
- Progress Notice: If the agent provides a correctly formatted sequence but $t < T_{\max}$, the server returns a "Progress" message, withholding the actual tool output and prompting the agent to increment $t$ and call the tool again.
- Repair Notice: If the agent abbreviates the sequence or formats it incorrectly, the server returns an explicit "Repair" message demanding a compliant retry without advancing $t$.
- Terminal Return: Only when $t = T_{\max}$ and the sequence validates does the server return the actual benign tool payload. This forces the agent to generate massive amounts of tokens strictly during the intermediate tool-calling steps.
Impact:
- Financial/Token Exhaustion: Amplifies per-query outputs to over 60,000 tokens, inflating API token budgets/costs by up to 658x.
- Hardware Resource Exhaustion: Increases total query energy consumption by 100x to 560x.
- System Denial of Service: Spikes peak GPU Key-Value (KV) cache occupancy from <1% to 35–74%. This sustained resource occupancy throttles system scheduling, degrading overall throughput (tokens/s) for concurrent, benign workloads by 50% to over 60%, severely limiting Out-of-Memory (OOM) safe concurrency.
Affected Systems:
- Autonomous LLM agents utilizing multi-turn tool calling and standardized agent-tool protocols like the Model Context Protocol (MCP).
- Tested and confirmed vulnerable underlying models include Qwen-3-32B, Llama-3.3-70B-Instruct, Llama-DeepSeek-70B, Mistral Large, Seed-32B, and GLM-4.5-Air.
Mitigation Steps:
- Behavioral Baselines: Develop and deploy trajectory monitors that evaluate the efficiency of the entire agentic workflow against expected behavioral baselines, rather than relying on generation-level self-monitoring or final-output safety judges (which fail to flag task-correct loops).
- Tool Provenance Controls: Restrict agent access to strictly vetted, first-party tools, preventing connection to untrusted MCP servers capable of defining arbitrary return policies.
- Strict Hard Limits: Enforce aggressive per-session token caps and hard tool-call iteration limits. While this does not detect or completely prevent the attack, it acts as a throttle to cap worst-case cost and KV-cache amplification.
© 2026 Promptfoo. All rights reserved.