LMVD-ID: 47025a0b
Published September 1, 2025

Emergent Agentic Vulnerabilities

Research Paper

Mind the Gap: Comparing Model-vs Agentic-Level Red Teaming with Action-Graph Observability on GPT-OSS-20B

View Paper

Description: GPT-OSS-20B exhibits "agentic-only" vulnerabilities where safety guardrails effective in standalone model inference fail when the model operates within an agentic execution loop. These vulnerabilities emerge when the model is deployed in a multi-step agentic architecture (e.g., utilizing LangGraph, tool usage, and memory retention). Attackers can bypass safety filters by employing context-aware iterative refinement attacks, which incorporate the full agentic state—including tool outputs, conversation history, and inter-agent memory—into the adversarial prompt generation. Specific execution contexts, particularly those involving tool termination or agent-handoffs, alter the model's vulnerability profile, rendering it susceptible to harmful objectives (e.g., from HarmBench) that are strictly refused during isolated model-level interaction.

Examples: The vulnerability requires setting up GPT-OSS-20B within a hierarchical agentic system (e.g., a multi-agent analyst assistant) with access to tools.

  1. Agentic-Context Iterative Attack:
  • Setup: Deploy GPT-OSS-20B as the core reasoning engine in a LangGraph-based system with access to a Python code interpreter and agent-transfer capabilities.
  • Attack Vector: Inject a malicious prompt as a "Human Message" into the agent's input stream.
  • Refinement: Instead of standard prompt engineering, the attacker uses an adversarial model (e.g., GPT-4o-mini) to refine the prompt based on the entire context of the target agent's previous action (including the tool output and memory state), not just the text response.
  • Result: Objectives previously blocked (e.g., generating malware code) are executed.
  • Specific Dataset Reference: See the agentic_actions_GPT_OSS_20B dataset for the specific execution traces and refined prompts that trigger the bypass.
  1. Tool-Specific Exploitation (Agent Transfer):
  • Target the specific action transfer_to_strategic_analyst_agent_node.
  • Inject a refined prompt designed to exploit the logic flow of the handoff.
  • Result: ASR (Attack Success Rate) reaches 67%, significantly higher than non-tool contexts.
  1. Tool-Specific Exploitation (Code Execution):
  • Target the run_python_code tool action.
  • Inject a context-aware prompt requesting harmful script generation.
  • Result: The model generates and executes the code, bypassing restrictions present in the standalone model.

Impact:

  • Safety Bypass: Circumvention of alignment training and safety guardrails, allowing the generation of harmful content, hate speech, or illicit instructions.
  • Contextual Manipulation: Attackers can manipulate the agent's logic flow, memory, and decision-making processes.
  • Tool Misuse: High potential for malicious use of connected tools, particularly Remote Code Execution (RCE) via Python interpreters or unauthorized data access via RAG systems.
  • Agentic-Only Vectors: Security assessments performed on the base model will yield false negatives, leaving the deployed agent vulnerable to attacks that only manifest during complex, multi-step interactions.

Affected Systems:

  • Systems deploying GPT-OSS-20B within agentic frameworks (e.g., LangChain, LangGraph, AutoGPT).
  • Agentic implementations utilizing tool calling (specifically Python execution and agent transfer) and stateful memory (short-term/long-term).

Mitigation Steps:

  • Agentic-Level Red Teaming: Implement security evaluations within the full agentic loop rather than testing the model in isolation. Use frameworks that expose the "action graph" (like AgentSeer) to identify vulnerable execution nodes.
  • Context-Aware Filtering: Apply input/output filtering that analyzes the semantic content of the entire execution context (history + tool outputs), not just the immediate user prompt or model response.
  • Tool Hardening: Implement strict validation and sandboxing for high-risk tools, particularly agent transfer operations and code execution environments.
  • Semantic Monitoring: Monitor agent interactions for semantic patterns indicative of attacks, rather than relying on heuristic limits like context length (which showed no correlation to vulnerability).

© 2026 Promptfoo. All rights reserved.