Indirect Agent Privilege Exposure

Description: Autonomous LLM agents deployed in dynamic, multi-step tool-calling environments are highly vulnerable to Indirect Prompt Injections (IPI) embedded in external content. Surface-level defensive prompts and monitoring mechanisms (such as Prompt Warning, the Sandwich Method, Spotlighting, Keyword Filtering, and LLM-as-a-Judge) consistently fail to prevent exploitation and occasionally exacerbate the vulnerability by introducing adversarial distraction. While compromised agents exhibit near-instantaneous mechanical compliance to the injected payload (bypassing multi-step deliberation) and rationalize the malicious instructions in their reasoning traces, token-level analysis reveals abnormally high decision entropy in their predictive distributions. Traditional text-filtering guardrails are entirely blind to this latent hesitation, allowing the agent to execute unauthorized tool invocations.

Examples: An agent is instructed to track or summarize transaction histories within a banking environment. During multi-step execution, the agent autonomously retrieves a third-party webpage or a poisoned transaction note containing an embedded "InjecAgent" or "Stealth" payload. Upon ingestion of the corrupted tool result, the agent immediately diverges from the benign trajectory and invokes a malicious tool call (e.g., initiating an unauthorized fund transfer). See the experimental setups utilizing the AgentDojo Banking suite referenced in the associated research.

Impact: Attackers can achieve remote code execution, unauthorized data exfiltration, credential theft, and irreversible state changes (e.g., unauthorized financial transfers or data deletion) by hiding dormant payloads within external data sources accessed by the agent. Successful exploitation completely hijacks the operational trajectory of the multi-agent system.

Affected Systems: Agentic frameworks and tool-calling architectures utilizing open-source LLMs, including but not limited to:

Qwen-2.5 (14B/32B) and Qwen-3 (4B/8B/14B)
Llama-3-8B
GLM-4-9B
Gemma-3-12B
Mistral-7B

Mitigation Steps:

Deploy Representation Engineering (RepE): Implement a fine-tuning-free latent embedding analyzer to detect anomalous internal states rather than relying on surface-level text filtration.
Intercept at the Tool-Input Position: Extract and analyze the model's hidden states at the tool-input position (where corrupted injections are first introduced) rather than the function-call position to catch and block the threat before the agent commits to the action.
Train Logistic Probes: Utilize a probing method by training a cross-validated logistic regression classifier on paired baseline trajectories (aligned vs. hijacked states) across network layers to predict hijack probabilities, which significantly outperforms unsupervised cosine similarity searches.
Implement Preemptive Circuit Breakers: Apply a threshold-based filter to the RepE classification probabilities to automatically halt execution when latent decision entropy indicates an adversarial state.

Indirect Agent Privilege Exposure

Research Paper