System Prompt Signal Inversion
Research Paper
The System Prompt Is the Attack Surface: How LLM Agent Configuration Shapes Security and Creates Exploitable Vulnerabilities
View PaperDescription: LLM-based autonomous email security agents configured with signal-based system prompts are vulnerable to a "signal inversion" attack via infrastructure phishing. When a system prompt instructs an LLM to prioritize a specific heuristic—such as sender-URL domain consistency—attackers can bypass the security filter entirely by registering a single, inexpensive domain and using it for both the sender email address and the malicious payload host. Because the LLM faithfully executes the prioritized prompt instruction, it accurately verifies the domain match and subsequently overrides its own detection of other anomalous content signals (e.g., unusual URL paths or credential harvesting lures). The vulnerability stems from an informational gap: the model enforces the prompt's structural rule but lacks the external ground truth (like domain age or reputation) needed to distinguish a newly registered attacker domain from an established corporate domain.
Examples:
An attacker registers the domain flow.page and sends the following infrastructure phishing email:
- From:
sarah.jenkins@flow.page - Subject:
Vendor Agreement Review – Q3 Renewal - Body:
"Hi Alex, Hope you’re having a good week. I’m following up on the Q3 vendor agreement for your review. […] You can access the revised document here: https://www.flow.page/mlh21880"
Under a prompt strategy that anchors trust on sender-URL matching (e.g., "If the URL domain matches the sender’s organization, it is likely safe"), the LLM explicitly cites the domain consistency as evidence of legitimacy and permits the email, despite the link leading to a credential harvesting page.
Impact: Successful exploitation allows malicious emails to completely bypass the automated security boundary, exposing users to credential theft, session hijacking, or malware delivery. In empirical testing, signal inversion via infrastructure phishing collapsed the detection recall of optimized LLMs by up to 68.6 percentage points (e.g., GPT-4o-mini's recall dropped from 93.7% to 30.1%), as models explicitly subordinated suspicious content warnings to the attacker-controlled domain-match heuristic.
Affected Systems:
- Autonomous LLM agents and email triage integrations (using models such as GPT-4o-mini, GPT-5.2, Claude 3.5 Sonnet/Haiku, Gemini 1.5/2.5 Flash, Grok, etc.) that rely on content-based analysis.
- The vulnerability is fundamentally prompt-driven. However, models characterized as "calibrated instruction-followers" (e.g., GPT-4o-mini) are uniquely exploitable under these configurations because their strict adherence to the prompt's primary heuristic overrides their generalized safety training.
Mitigation Steps:
- Tool Augmentation (Ground Truth Integration): Do not rely on LLM text analysis alone for domain evaluation. Augment the LLM pipeline with external tool access (WHOIS/domain age lookups, threat intelligence feeds, URL sandboxing, organizational sender allowlists) to provide the model with objective data regarding domain reputation and payload behavior.
- Infrastructure-Aware Prompting: Avoid underspecified or single-signal instructions (e.g., "base your decision primarily on X"). Explicitly warn the model about infrastructure matching and grant explicit override permissions (e.g., "A domain match does NOT guarantee safety. Evaluate if the domain is newly registered or requests sensitive actions. You have permission to override the consistency signal.").
- Pre-deployment Disposition Calibration: Match prompt specificity to the underlying model's behavioral disposition. Using infrastructure-aware override permissions on precise instruction-followers (e.g., GPT-4o-mini) restores robustness, but deploying the exact same prompt on safety-amplifying models (e.g., Claude Haiku) will trigger an overcorrection, leading to catastrophic false-positive rates (up to 67.6%).
- Two-Model Ensembles: Deploy a multi-layered architecture where a high-precision, signal-based detector handles commodity phishing, and a second, infrastructure-resilient model (e.g., Qwen 3) acts as a second opinion for approved emails to catch signal inversion attacks.
© 2026 Promptfoo. All rights reserved.