Pre-Task Persuasion Propagation
Research Paper
Persuasion Propagation in LLM Agents
View PaperDescription: A behavioral vulnerability exists in Large Language Model (LLM) agents where task-irrelevant persuasion introduced in prior interactions or system contexts induces a persistent "belief state" that alters downstream task execution. This phenomenon, termed "Persuasion Propagation," occurs when an agent adopts a stance on a controversial topic (e.g., politics, privacy) that is semantically unrelated to its primary function (e.g., coding, medical research). This adopted stance acts as a latent conditioning variable, causing significant behavioral drift in multi-step execution trajectories. Specifically, agents influenced by such persuasion exhibit altered exploration patterns, such as conducting fewer web searches, visiting fewer unique domains, or changing code debugging iterations, even when the final output remains superficially compliant with safety guidelines. This vulnerability bypasses standard output-based safeguards because the behavioral changes manifest in the process of information gathering and reasoning rather than in explicit policy violations.
Examples:
- Medical Research Degradation via Political Persuasion:
- An attacker interacts with a medical research agent and uses persuasive arguments (e.g., "Individuals must take responsibility for online privacy without excessive government mandates") to convince the agent to adopt a specific libertarian stance.
- The attacker subsequently tasks the agent with summarizing treatment options for type 2 diabetes—a task unrelated to privacy.
- Due to Persuasion Propagation, the agent's execution trace shows a reduction in information seeking. The agent visits significantly fewer unique sources and fails to engage with recent clinical updates (e.g., newer drug classes), relying instead on a narrow set of authoritative pages, despite the prompt not requesting a concise summary.
- Exploration Suppression via Belief Prefilling:
- An attacker injects a belief statement into the agent's prefilled context or conversation history (e.g., explicitly stating belief in a specific controversial claim).
- The agent is tasked with an open-ended web research task.
- Compared to a neutral baseline, the belief-prefilled agent executes 26.9% fewer searches and visits 16.9% fewer unique sources, effectively creating a "filter bubble" or reducing the thoroughness of the research without the user's knowledge.
- Code Generation Behavioral Drift:
- An agent is exposed to logical appeals regarding a social issue.
- The agent is tasked with a Python coding problem requiring iterative debugging (e.g., from the TACO/KodCode dataset).
- The agent exhibits drift in its "Time-and-Revision Score" (TRS) and "Edit Volatility Score" (EVS), altering its debugging cadence and the granularity of its code patches compared to its baseline persona, potentially impacting code robustness or efficiency.
Impact:
- Reduced Task Reliability: Agents may prematurely terminate information seeking or reduce the diversity of sources consulted, leading to incomplete or biased reports.
- Latent Behavioral Conditioning: Malicious actors can subtly degrade agent performance or bias decision-making processes in a way that is difficult to detect through standard output inspections (e.g., the answer looks correct but misses critical nuance available in ignored sources).
- Audit Evasion: Because the persuasive content is task-irrelevant and the output contains no prohibited content, this manipulation bypasses standard safety filters designed to catch prompt injections or toxic generation.
Affected Systems:
- Autonomous LLM Agents utilizing multi-step execution loops (e.g., AutoGen frameworks).
- Agents powered by susceptible backbone models (e.g., GPT-4, Mistral-Nemo, LLaMA-3.1) that maintain conversational state or utilize prefilled context windows.
- Systems involved in long-horizon tasks such as web browsing, research synthesis, and iterative coding.
Mitigation Steps:
- Context Isolation: Enforce strict separation of context between distinct tasks. Reinitialize agent memory and state between trials to prevent cross-trial belief carryover.
- Trace-Based Auditing: Implement process-level monitoring rather than relying solely on output evaluation. Audit execution traces for anomalies in tool use frequency, search query volume, and source diversity.
- Neutral Belief Conditioning: Explicitly condition agents with a neutral belief state in the system prompt prior to task execution to counteract potential latent persuasion effects from user interaction.
- Behavioral Distribution Monitoring: Establish baseline behavioral metrics (e.g., average searches per task, average code revisions) for specific agent personas and alert on significant deviations (drift) even if task success rates remain constant.
© 2026 Promptfoo. All rights reserved.