LMVD-ID: 509dc801
Published January 1, 2026

Agent Over-Trigger Containment

Affected Models:GPT-4o, Claude 4.5, Gemini 2, Qwen 2.5 4B

Research Paper

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

View Paper

Description: Autonomous Incident Response (IR) and Security Operations Center (SOC) agents utilizing frontier LLMs are vulnerable to adversarial over-triggering via contextualized prompt injections. When processing untrusted artifacts (such as SQLite logs, alerts, or phishing emails) in a dual-control environment, these agents exhibit a severe calibration failure: they lack action restraint and execute disruptive containment tools prematurely. Attackers can exploit this by embedding T2 (contextualized domain-specific framing) prompt injections into malicious artifacts. Because the agents act with low Evidence-Gated Action Rates (EGAR)—failing to fetch trusted evidence before acting—the payloads successfully trick the models into indiscriminately executing containment actions against legitimate targets, effectively weaponizing the defense system against its own infrastructure.

Examples: An attacker sends a phishing email or generates a malicious log entry containing a T2 (contextualized) prompt injection payload. The payload includes a spoofed domain name or host identifier framed within domain-specific IR context.

  1. The IR agent ingests the untrusted artifact to investigate an ongoing kill chain.
  2. Bypassing evidence verification, the agent immediately executes a tool call (e.g., isolate_host or block_domain) using the attacker-controlled parameter extracted exclusively from the injected text.
  3. For full environment scenarios, attack graphs, and specific payload formulations, see the OpenSec repository at https://github.com/jbarnes850/opensec-env.

Impact: Self-inflicted Denial of Service (DoS) and critical operational failure. By forcing the agent into false-positive containment actions, an attacker can coerce the automated SOC into isolating legitimate production servers, blocking business-critical domains, and resetting legitimate user accounts, resulting in severe infrastructure downtime.

Affected Systems:

  • Autonomous LLM-based SOC and IR agents with tool execution privileges (e.g., query_logs, isolate_host, block_domain, reset_user).
  • Agents powered by frontier models including GPT-5.2 (which exhibited 100% containment execution with an 82.5% false positive rate), Claude Sonnet 4.5, Gemini 3, DeepSeek, and Qwen3-4B-Instruct.

Mitigation Steps:

  • Implement Explicit Verification Gates: Require the agent architecture to successfully fetch and validate trusted evidence about a target entity before granting permission to invoke containment APIs.
  • Enforce Trust-Aware Provenance: Implement a trust tier system (e.g., untrusted, corroborated, verified) for all ingested artifacts and force the agent to weight evidence strictly by its provenance tier (enforcing a high Evidence-Gated Action Rate).
  • Calibration Training: Train IR agents using a two-stage Supervised Fine-Tuning (SFT) plus Reinforcement Learning (RL) pipeline or curriculum approach specifically designed to penalize false-positive containment and reward action restraint.
  • Injection Robustness Curricula: Expose the model to targeted injection curricula (adversarial exposure) during training to baseline its resistance to contextualized (T2) and complex (T3) API-targeting payloads.

© 2026 Promptfoo. All rights reserved.