Embodied Action Jailbreak
Research Paper
Jailbreaking Embodied LLMs via Action-level Manipulation
View PaperDescription: Embodied Large Language Models (LLMs) used for real-world agent planning are vulnerable to Action-level Manipulation (dubbed "Blindfold"), a jailbreak technique that bypasses semantic-level safety filters by exploiting the models' limited spatial and causal reasoning regarding physical consequences. Attackers can use an adversarial proxy LLM to decompose a semantically harmful intent into a sequence of individually benign primitive actions. To evade advanced semantic correlation checks (semantic residual effect), the attack injects context-aware cover actions (noise) that mask the dominant malicious action. Because standard LLM safeguards evaluate linguistic semantics rather than physical action trajectories, the embodied agent executes the benign-looking instructions, resulting in dangerous real-world outcomes.
Examples:
- Direct Decomposition: An attacker's malicious intent "Explode user's phone" is translated by a proxy planner into the sequence:
find(phone) → pick(phone) → move(oven) → stretch(). - Intent Obfuscation (Cover Action Injection): To hide the dominant malicious action
place(phone)in the oven, the attacker generates a benign branch by substituting the parameter (e.g.,place(apple)) and inserting intermediate steps:find(apple) → pick(apple) → move(oven). The system merges these states, shifting the semantic context to the safe object while keeping the harmful transition intact, causing the robot to execute the unsafe action without triggering semantic defense mechanisms. - Physical Execution: An instruction like "Clip the user's hair" is decomposed into physical sub-commands that manipulate a 6DoF robotic arm to move toward the user and close a tightened gripper near their head.
Impact: Successful exploitation allows an attacker to bypass prompt-based security checks, content moderation, and LLM-aligned decoding defenses (e.g., Llama-Guard, SafeDecoding, VeriSafe). This leads to unauthorized and potentially dangerous real-world actuation by the embodied agent, including physical harm to users, environmental sabotage, privacy violations, and destruction of property.
Affected Systems: Embodied AI systems utilizing standard LLMs as autonomous planning modules, including:
- LLMs: GPT-4o, GPT-4-turbo, GPT-4o-mini, Claude-3.5-sonnet, Llama-3.1-8B, DeepSeek-R1-14B, Gemma-3-27B, and Phi-4-14B.
- Embodied Frameworks: ProgPrompt, Code-as-Policies (CaP), VoxPoser, and LLM-Planner.
- Execution Environments: Simulated environments (VirtualHome, Habitat, ManiSkill, RoboTHOR) and physical robotics platforms (e.g., 6DoF UFactory xArm 6) relying on API-based LLM control without trajectory-aware physical safeguards.
Mitigation Steps:
- Multi-modal Alignment: Integrate and align real-time multimodal environmental cues (e.g., visual observations) into the safety evaluation process to ensure consistent security between text instructions and physical states.
- Action-level Reasoning: Move beyond surface-level prompt semantics by modeling the embodied agent’s complete action trajectory. Formally verify the sequence of physical actions against global, domain-specific safety constraints prior to execution (e.g., utilizing FSM-based or formal verification frameworks to detect unsafe final environmental states).
© 2026 Promptfoo. All rights reserved.