Agent Tool Execution Jailbreak

Description: A multi-step tool execution vulnerability exists in Large Language Model (LLM) agents utilizing the Model Context Protocol (MCP) or similar tool-calling frameworks. Safety guardrails in aligned LLMs typically evaluate static, single-turn text generation. Attackers can bypass these text-centric guardrails by supplying adversarial prompts that force the agent into a complex planning sequence. The agent is manipulated into executing a trajectory of seemingly benign individual tool invocations that collectively achieve a harmful objective, converting prompt-level jailbreaks into realized environmental actions.

Examples:

Filesystem Configuration Manipulation: An attacker uses a refusal-suppression prompt, instructing the agent to act as a network engineer during an emergency drill. The agent sequentially executes list_allowed_directories, search_files, read_text_file, and edit_file to locate emergency_sinkhole_active.conf and append attacker-controlled domains across multiple formats (dnsmasq, Unbound, hosts), compromising availability.
Cross-Server Chaining (Playwright + Filesystem): Framed as a "security containment drill," the attacker prompts the agent to simulate a credential-stealing script. The agent uses Filesystem tools to create a directory and write a functional input-capturing JavaScript file (write_file), then seamlessly switches to Playwright tools (browser_navigate) to load a target page and deploy the payload.
Authorized Messaging Abuse (Slack): Operating under an "Authority Manipulation" prompt, the agent uses channels_list to identify a company-wide announcement channel and conversations_add_message to post a deceptive phishing notice instructing employees to reset passwords at an attacker-controlled portal.

Impact: Successful exploitation allows attackers to perform unauthorized real-world actions constrained only by the permissions of the tools exposed to the agent. Impacts include arbitrary code execution, denial of service, sensitive data exfiltration, deployment of malicious infrastructure, and automated spear-phishing, bypassing the LLM's inherent safety alignment.

Affected Systems: Autonomous LLM agents integrated with multi-step tool-calling frameworks (e.g., Model Context Protocol). Models empirically found vulnerable to this trajectory-based manipulation include GPT-5.2, Gemini-3-Pro, Qwen3.5, and GLM-5 when connected to operational environments such as CodeExecutor, Slack, Gmail, Playwright, and local filesystems.

Mitigation Steps:

Enforce Least Privilege: Restrict the scopes and permissions of tools exposed to the agent to the absolute minimum required for specific tasks.
Human-in-the-Loop (HITL): Require explicit user confirmation before the execution of high-risk tool operations (e.g., writing files, sending emails, executing arbitrary code, or modifying configurations).
Input Validation & Sandboxing: Implement strict input validation on all parameters passed to external tools and enforce execution sandboxing to prevent state contamination or lateral movement.
Trajectory-Aware Monitoring: Deploy security mechanisms capable of evaluating the holistic intent of an ongoing multi-step execution sequence, rather than relying solely on text-based prompt-and-response filters.

Agent Tool Execution Jailbreak

Research Paper