LMVD-ID: 4d7fdad8
Published November 1, 2025

MAS Pipeline Weak Links

Affected Models:GPT-4o

Research Paper

Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting

View Paper

Description: Centralized Multi-Agent Systems (MAS) architectures, specifically those implemented using frameworks such as Magentic-One, LangGraph, and OpenAI Agents, contain design-level vulnerabilities facilitating safety alignment bypasses. Two primary mechanisms drive this vulnerability:

  1. Context Fragmentation via Atomic Delegation: The orchestrator decomposes adversarial prompts into atomic, granular sub-tasks (e.g., "click button," "write function header") before delegating them to sub-agents. These sub-agents, lacking the global context of the original harmful objective, fail to recognize the malicious intent and execute the atomic tasks. This allows the system to incrementally achieve a harmful goal that individual agents would refuse if provided full context.
  2. Planner Refusal Bypass (Inadequate Fallback): When a specific "Planner" agent identifies a harmful prompt and refuses to generate a plan, the orchestration logic lacks explicit termination instructions. Consequently, the system defaults to invoking downstream sub-agents despite the planner's refusal, exposing the system to "Unmitigated Execution" of the adversarial prompt.

Examples:

  • Context Fragmentation (SafeArena Benchmark): In a Magentic-One implementation, a WebSurfer agent receives isolated commands such as "open website" or "click button" derived from a harmful user objective. Because the WebSurfer processes these atomic instructions without the original user prompt (which contained the malicious intent), it executes the actions. This results in high "Unmitigated-Execution" classifications (66.37%) and successful attacks (ARIA-4).
  • Planner Refusal Bypass (AgentHarm Dataset): In Magentic-One, when the planner agent explicitly refuses to generate a plan for a harmful task (issuing a warning), the control flow does not terminate. The orchestrator ignores the stop signal and hands control to subsequent agents (e.g., a Coder or Executor), who then attempt to proceed with the task.
  • Refusal Override (RedCode Benchmark): In centralized implementations, an orchestrator generates a plan for malicious code generation. Even if a Coder sub-agent refuses the specific request, the orchestrator logic may interpret this as a task failure rather than a safety stop, subsequently re-routing the request or ignoring the refusal to force execution.
  • See repository: https://github.com/microsoft/SafeAgents

Impact: This vulnerability allows attackers to circumvent LLM safety alignment layers, resulting in the successful execution of harmful tasks that single-agent systems would typically reject. The impact includes:

  • Malicious Code Execution: Attack Success Rate (ARIA 4) in Magentic Centralized increases to 83.70% on the RedCode benchmark (compared to 62.5% for single-agent baselines).
  • Unsafe Web Interactions: Agents can be coerced into performing harmful actions on web platforms (SafeArena benchmark).
  • Safety Guardrail Evasion: Systemic failure of refusal mechanisms due to distributed responsibility and fragmented context.

Affected Systems:

  • Multi-Agent Systems utilizing Centralized/Orchestrator architectures.
  • Implementations based on Magentic-One.
  • Implementations based on LangGraph (specifically centralized patterns with atomic delegation).
  • Implementations based on OpenAI Agents (specifically centralized patterns).

Mitigation Steps:

  • Implement Context Sharing: Ensure sub-agents receive the original high-level user request/context alongside atomic instructions to enable local safety judgment.
  • Enforce Strict Fallback Logic: Program the orchestrator to immediately terminate the execution pipeline if the Planner agent returns a refusal, warning, or fails to generate a valid plan. Do not default to downstream agent invocation.
  • Sub-Agent Autonomy: Increase the granularity of delegated tasks to ensure sub-agents retain sufficient context to recognize harmful objectives (avoid purely atomic instruction delegation).
  • Refusal Propagation: Implement logic where a refusal from any sub-agent (Planner, Coder, or WebSurfer) propagates up to the orchestrator as a critical stop signal, preventing alternative routing or forced execution.

© 2026 Promptfoo. All rights reserved.