Mobile Agent Visual Spoofing

Description: Mobile Large Language Model (LLM) agents operating under the "Screen-as-Interface" paradigm are vulnerable to visual indirect prompt injection and state desynchronization. Agents that rely on unstructured visual data (screenshots) and Accessibility Service APIs to perceive the environment lack a mechanism to distinguish between trusted system UI elements and untrusted content (e.g., web pages, emails, or malicious overlays). An attacker can inject visual cues, fake notifications, or hidden text instructions into the display execution context. The agent's multimodal planner interprets these adversarial inputs as authoritative state changes or high-priority user instructions, causing the agent to deviate from the user's intent. This allows for "confused deputy" attacks where the agent utilizes its elevated system privileges ("God Mode") to execute unauthorized actions, exfiltrate sensitive data across applications, or interact with malicious domains.

Examples:

Visual Phishing and Persistent Autonomy: A user instructs the agent to process an email. The email contains a visually deceptive link labeled "Open in Sheets" which is actually a phishing URL. The agent, relying on visual cues, attempts to click the link. If deep-linking fails, the agent's autonomous planning loop persists by copying the raw URL and manually opening it in an external browser, navigating to the unverified malicious domain without user consent.
State Desynchronization via Overlay Injection: A user asks the agent to "check recent emails." A malicious background application detects the agent's activity and injects a transparent or imitation overlay (e.g., a fake "Network Lag Detected - Click Refresh" popup) over the legitimate Gmail interface. The agent perceives this overlay as the ground truth state. When the agent attempts to click the "Refresh" button, the click coordinates are intercepted by the malicious app to trigger a hidden privileged action, such as confirming a paid subscription or enabling background data collection permissions.
Malicious Memo Completion: A user asks the agent to "fill in the incomplete parts" of a memo containing a prohibited chemical synthesis process. Despite the memo image containing a warning text ("Not for legal use"), the visual encoder downweights this non-salient text. The agent treats the request as a functional blank-filling operation and generates the hazardous chemical synthesis instructions, bypassing safety alignment filters that would typically catch text-only prompts.

Impact:

Unauthorized Privilege Escalation: Attackers can leverage the agent's extensive permissions (INJECT_EVENTS, READ_FRAME_BUFFER) to control the device.
Cross-App Data Pivoting: Sensitive data (e.g., passcodes, contacts) can be extracted from a secure app and unwittingly pasted into a public or malicious sink (e.g., a social media post or phishing form) by the agent.
Financial Loss: Unintended execution of financial transactions or subscription confirmations via UI hijacking.
Safety Bypass: Circumvention of LLM safety alignment regarding illegal or toxic content generation.

Affected Systems:

Doubao Mobile Assistant (Doubao-Standard and Doubao-Pro variants)
Mobile agents relying on Accessibility Services (A11y) or Android Debug Bridge (ADB) for screen scraping and event injection (e.g., AutoGLM, Mobile-Agent).

Mitigation Steps:

Transition to Structured Intent: Replace unstructured GUI scraping with a Hub-and-Spoke topology where a System Agent (SA) orchestrates tasks via structured, deterministic APIs exposed by sandboxed App Agents (AAs).
Implement an Agent Kernel: Deploy a TEE-backed OS module to mediate all interactions between the System Agent and App Agents, serving as a policy-enforcement choke point.
Cryptographic Identity Binding: Enforce Agent Identity Cards (AIC) and a Global Agent Registry (GAR) to strictly authenticate agents and prevent app impersonation.
Semantic Input Sanitization: Implement a Semantic Firewall that performs origin checks, sensitive data redaction, and prompt isolation (wrapping external observations in non-executable XML tags) before data reaches the LLM context.
Just-in-Time (JIT) Permissions: Replace static "God Mode" permissions with dynamic, intent-driven permission negotiation that validates requests against a static capability boundary.
Critical Node Interception: Enforce "Monitor-on-Access" for high-stakes APIs (financial, data egress), requiring explicit user confirmation or passing a Runtime Alignment Validator check before execution.

Mobile Agent Visual Spoofing

Research Paper