Personalized Agent Double Agent
Research Paper
From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent
View PaperDescription: OpenClaw is vulnerable to Indirect Prompt Injection (IPI), Tool-Return Manipulation, and Persistent Memory Poisoning. The agent incorporates untrusted external content (e.g., fetched web pages) and external tool outputs directly into its observation stream without sufficient isolation. An attacker can embed malicious payloads into these external channels to hijack the agent's planning and execution trace. This allows the attacker to silently trigger high-privilege actions via OpenClaw's Skills registry (translating into unauthorized TypeScript asynchronous operations), extract private assets from Short-Term Memory (STM) and Long-Term Memory (LTM), and overwrite LTM markers to persistently compromise the agent's behavior across future, unrelated sessions.
Examples: Specific payload configurations and the end-to-end testing harness are available in the Personalized Agent Security Bench (PASB) dataset. See repository at https://github.com/AstorYH/PASB. In the paper's evaluation, attackers successfully targeted 131 high-impact tools from OpenClaw's public Skills registry (e.g., messaging, transactions) and achieved persistent LTM modification, verified via the OpenClaw file system.
Impact: Compromise of the agent's action chain, leading to unauthorized high-privilege tool invocation (e.g., permission modifications, financial transactions), arbitrary exfiltration of private user context, and persistent cross-session behavioral manipulation via poisoned long-term memory stores.
Affected Systems: OpenClaw personalized local AI agents.
Mitigation Steps:
- Instruction Prevention Defense: Apply strict predefined system instructions that explicitly prohibit the agent from executing specific high-risk actions, regardless of user or external input.
- Delimiter Defense: Enclose all ingested external content, tool returns, and retrieved memory within robust delimiters to explicitly segregate system instructions from untrusted data.
- Sandwich Defense: Surround user prompts and fetched context with protective system instructions at both the beginning and end of the context window to prevent payload execution.
- Note: Research indicates that these prompt-layer defenses significantly reduce the Attack Success Rate (e.g., dropping from 66.8% to 22.0% in combined attacks) but do not fully eliminate the vulnerability. Architectural isolation of the tool execution environment and human-in-the-loop authorization for high-privilege asynchronous operations are strictly recommended.
© 2026 Promptfoo. All rights reserved.