LMVD-ID: 4d7fdad8
Published November 1, 2025
MAS Pipeline Weak Links
Affected Models:GPT-4o
Research Paper
Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting
View PaperDescription: Centralized Multi-Agent Systems (MAS) architectures, specifically those implemented using frameworks such as Magentic-One, LangGraph, and OpenAI Agents, contain design-level vulnerabilities facilitating safety alignment bypasses. Two primary mechanisms drive this vulnerability:
- Context Fragmentation via Atomic Delegation: The orchestrator decomposes adversarial prompts into atomic, granular sub-tasks (e.g., "click button," "write function header") before delegating them to sub-agents. These sub-agents, lacking the global context of the original harmful objective, fail to recognize the malicious intent and execute the atomic tasks. This allows the system to incrementally achieve a harmful goal that individual agents would refuse if provided full context.
- Planner Refusal Bypass (Inadequate Fallback): When a specific "Planner" agent identifies a harmful prompt and refuses to generate a plan, the orchestration logic lacks explicit termination instructions. Consequently, the system defaults to invoking downstream sub-agents despite the planner's refusal, exposing the system to "Unmitigated Execution" of the adversarial prompt.
Examples:
- Context Fragmentation (SafeArena Benchmark): In a Magentic-One implementation, a
WebSurferagent receives isolated commands such as "open website" or "click button" derived from a harmful user objective. Because theWebSurferprocesses these atomic instructions without the original user prompt (which contained the malicious intent), it executes the actions. This results in high "Unmitigated-Execution" classifications (66.37%) and successful attacks (ARIA-4). - Planner Refusal Bypass (AgentHarm Dataset): In Magentic-One, when the planner agent explicitly refuses to generate a plan for a harmful task (issuing a warning), the control flow does not terminate. The orchestrator ignores the stop signal and hands control to subsequent agents (e.g., a
CoderorExecutor), who then attempt to proceed with the task. - Refusal Override (RedCode Benchmark): In centralized implementations, an orchestrator generates a plan for malicious code generation. Even if a
Codersub-agent refuses the specific request, the orchestrator logic may interpret this as a task failure rather than a safety stop, subsequently re-routing the request or ignoring the refusal to force execution. - See repository: https://github.com/microsoft/SafeAgents
Impact: This vulnerability allows attackers to circumvent LLM safety alignment layers, resulting in the successful execution of harmful tasks that single-agent systems would typically reject. The impact includes:
- Malicious Code Execution: Attack Success Rate (ARIA 4) in Magentic Centralized increases to 83.70% on the RedCode benchmark (compared to 62.5% for single-agent baselines).
- Unsafe Web Interactions: Agents can be coerced into performing harmful actions on web platforms (SafeArena benchmark).
- Safety Guardrail Evasion: Systemic failure of refusal mechanisms due to distributed responsibility and fragmented context.
Affected Systems:
- Multi-Agent Systems utilizing Centralized/Orchestrator architectures.
- Implementations based on Magentic-One.
- Implementations based on LangGraph (specifically centralized patterns with atomic delegation).
- Implementations based on OpenAI Agents (specifically centralized patterns).
Mitigation Steps:
- Implement Context Sharing: Ensure sub-agents receive the original high-level user request/context alongside atomic instructions to enable local safety judgment.
- Enforce Strict Fallback Logic: Program the orchestrator to immediately terminate the execution pipeline if the Planner agent returns a refusal, warning, or fails to generate a valid plan. Do not default to downstream agent invocation.
- Sub-Agent Autonomy: Increase the granularity of delegated tasks to ensure sub-agents retain sufficient context to recognize harmful objectives (avoid purely atomic instruction delegation).
- Refusal Propagation: Implement logic where a refusal from any sub-agent (Planner, Coder, or WebSurfer) propagates up to the orchestrator as a critical stop signal, preventing alternative routing or forced execution.
© 2026 Promptfoo. All rights reserved.