Multi-Turn Proxy Evasion
Research Paper
Peak+ Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection
View PaperDescription: A logical vulnerability exists in LLM security proxies and guardrails that utilize weighted-average aggregation algorithms for multi-turn risk scoring. The scoring logic exhibits a mathematical "ceiling property" where the cumulative conversation-level risk score converges to the per-turn score regardless of the number of interaction turns ($n$). Consequently, the aggregated score is bounded by the maximum single-turn score ($cum \leq \max(s_i)$). This allows remote attackers to bypass detection thresholds by submitting a persistent stream of malicious prompts where each individual turn achieves a risk score ($s$) slightly below the blocking threshold ($\tau$). Despite the accumulation of malicious evidence over time, the system will never trigger a violation, resulting in a false negative for persistent multi-turn attacks (e.g., "death by a thousand cuts" or extensive jailbreak probing).
Examples: Consider a security proxy with a blocking threshold $\tau = 0.7$. The system detects a "Role Confusion" attack pattern which is assigned a weight/score of $0.5$.
- Attack Scenario: An attacker sends a sequence of 20 prompts, every single one triggering the "Role Confusion" pattern ($s_i = 0.5$ for all $i$).
- Vulnerable Calculation (Weighted Average): The system calculates the cumulative score using a standard time-decay or linear weighted average: $$ \text{cum} = \frac{\sum_{i=0}^{n-1} s_i \cdot w_i}{\sum_{i=0}^{n-1} w_i} $$ Since $s_i = 0.5$ for all turns: $$ \text{cum} = 0.5 \cdot \frac{\sum w_i}{\sum w_i} = 0.5 $$
- Result: $0.5 < 0.7$. The attack is allowed to proceed indefinitely, despite 100% of turns being malicious.
Impact:
- Guardrail Bypass: Attackers can successfully execute multi-turn jailbreaks (e.g., Crescendo attacks, role-play manipulation) that rely on gradual escalation or persistence, provided individual turns do not trigger a high-severity flag immediately.
- Security Policy Evasion: Persistent probing of model boundaries goes undetected, allowing attackers to map filter weaknesses without triggering alerts.
Affected Systems:
- LLM API Proxies, AI Gateways, and Content Safety layers that implement deterministic or heuristic risk scoring (e.g., regex, keyword lists, metadata analysis) aggregated via weighted averaging across conversation history.
- Stateless guardrails that treat multi-turn scoring as a strictly averaging operation rather than an accumulative signal.
Mitigation Steps:
- Replace Weighted Averaging: Deprecate averaging logic for risk accumulation. Replace with an additive scoring model.
- Implement Peak + Accumulation Scoring: Adopt a formula that sums distinct risk signals:
- Peak Sensitivity: Include the maximum single-turn score as a baseline.
- Persistence Reward: Add a penalty proportional to the ratio of malicious turns to total turns (e.g.,
match_ratio * persistence_factor). - Diversity Reward: Add penalties for attacks spanning multiple distinct risk categories.
- Escalation Detection: Apply additive bonuses when risk scores monotonically increase over consecutive turns (detecting probing gradients).
- Resampling Detection: Apply additive bonuses when consecutive user inputs show high Jaccard similarity (detecting brute-force jailbreak attempts).
© 2026 Promptfoo. All rights reserved.