Emergent Agentic Vulnerabilities

Description: GPT-OSS-20B exhibits "agentic-only" vulnerabilities where safety guardrails effective in standalone model inference fail when the model operates within an agentic execution loop. These vulnerabilities emerge when the model is deployed in a multi-step agentic architecture (e.g., utilizing LangGraph, tool usage, and memory retention). Attackers can bypass safety filters by employing context-aware iterative refinement attacks, which incorporate the full agentic state—including tool outputs, conversation history, and inter-agent memory—into the adversarial prompt generation. Specific execution contexts, particularly those involving tool termination or agent-handoffs, alter the model's vulnerability profile, rendering it susceptible to harmful objectives (e.g., from HarmBench) that are strictly refused during isolated model-level interaction.

Examples: The vulnerability requires setting up GPT-OSS-20B within a hierarchical agentic system (e.g., a multi-agent analyst assistant) with access to tools.

Agentic-Context Iterative Attack:

Setup: Deploy GPT-OSS-20B as the core reasoning engine in a LangGraph-based system with access to a Python code interpreter and agent-transfer capabilities.
Attack Vector: Inject a malicious prompt as a "Human Message" into the agent's input stream.
Refinement: Instead of standard prompt engineering, the attacker uses an adversarial model (e.g., GPT-4o-mini) to refine the prompt based on the entire context of the target agent's previous action (including the tool output and memory state), not just the text response.
Result: Objectives previously blocked (e.g., generating malware code) are executed.
Specific Dataset Reference: See the agentic_actions_GPT_OSS_20B dataset for the specific execution traces and refined prompts that trigger the bypass.

Tool-Specific Exploitation (Agent Transfer):

Target the specific action transfer_to_strategic_analyst_agent_node.
Inject a refined prompt designed to exploit the logic flow of the handoff.
Result: ASR (Attack Success Rate) reaches 67%, significantly higher than non-tool contexts.

Tool-Specific Exploitation (Code Execution):

Target the run_python_code tool action.
Inject a context-aware prompt requesting harmful script generation.
Result: The model generates and executes the code, bypassing restrictions present in the standalone model.

Impact:

Safety Bypass: Circumvention of alignment training and safety guardrails, allowing the generation of harmful content, hate speech, or illicit instructions.
Contextual Manipulation: Attackers can manipulate the agent's logic flow, memory, and decision-making processes.
Tool Misuse: High potential for malicious use of connected tools, particularly Remote Code Execution (RCE) via Python interpreters or unauthorized data access via RAG systems.
Agentic-Only Vectors: Security assessments performed on the base model will yield false negatives, leaving the deployed agent vulnerable to attacks that only manifest during complex, multi-step interactions.

Affected Systems:

Systems deploying GPT-OSS-20B within agentic frameworks (e.g., LangChain, LangGraph, AutoGPT).
Agentic implementations utilizing tool calling (specifically Python execution and agent transfer) and stateful memory (short-term/long-term).

Mitigation Steps:

Agentic-Level Red Teaming: Implement security evaluations within the full agentic loop rather than testing the model in isolation. Use frameworks that expose the "action graph" (like AgentSeer) to identify vulnerable execution nodes.
Context-Aware Filtering: Apply input/output filtering that analyzes the semantic content of the entire execution context (history + tool outputs), not just the immediate user prompt or model response.
Tool Hardening: Implement strict validation and sandboxing for high-risk tools, particularly agent transfer operations and code execution environments.
Semantic Monitoring: Monitor agent interactions for semantic patterns indicative of attacks, rather than relying on heuristic limits like context length (which showed no correlation to vulnerability).

Emergent Agentic Vulnerabilities

Research Paper