Multi-Turn Proxy Evasion

Description: A logical vulnerability exists in LLM security proxies and guardrails that utilize weighted-average aggregation algorithms for multi-turn risk scoring. The scoring logic exhibits a mathematical "ceiling property" where the cumulative conversation-level risk score converges to the per-turn score regardless of the number of interaction turns ($n$). Consequently, the aggregated score is bounded by the maximum single-turn score ($cum \leq \max(s_i)$). This allows remote attackers to bypass detection thresholds by submitting a persistent stream of malicious prompts where each individual turn achieves a risk score ($s$) slightly below the blocking threshold ($\tau$). Despite the accumulation of malicious evidence over time, the system will never trigger a violation, resulting in a false negative for persistent multi-turn attacks (e.g., "death by a thousand cuts" or extensive jailbreak probing).

Examples: Consider a security proxy with a blocking threshold $\tau = 0.7$. The system detects a "Role Confusion" attack pattern which is assigned a weight/score of $0.5$.

Attack Scenario: An attacker sends a sequence of 20 prompts, every single one triggering the "Role Confusion" pattern ($s_i = 0.5$ for all $i$).
Vulnerable Calculation (Weighted Average): The system calculates the cumulative score using a standard time-decay or linear weighted average: $$ \text{cum} = \frac{\sum_{i=0}^{n-1} s_i \cdot w_i}{\sum_{i=0}^{n-1} w_i} $$ Since $s_i = 0.5$ for all turns: $$ \text{cum} = 0.5 \cdot \frac{\sum w_i}{\sum w_i} = 0.5 $$
Result: $0.5 < 0.7$. The attack is allowed to proceed indefinitely, despite 100% of turns being malicious.

Impact:

Guardrail Bypass: Attackers can successfully execute multi-turn jailbreaks (e.g., Crescendo attacks, role-play manipulation) that rely on gradual escalation or persistence, provided individual turns do not trigger a high-severity flag immediately.
Security Policy Evasion: Persistent probing of model boundaries goes undetected, allowing attackers to map filter weaknesses without triggering alerts.

Affected Systems:

LLM API Proxies, AI Gateways, and Content Safety layers that implement deterministic or heuristic risk scoring (e.g., regex, keyword lists, metadata analysis) aggregated via weighted averaging across conversation history.
Stateless guardrails that treat multi-turn scoring as a strictly averaging operation rather than an accumulative signal.

Mitigation Steps:

Replace Weighted Averaging: Deprecate averaging logic for risk accumulation. Replace with an additive scoring model.
Implement Peak + Accumulation Scoring: Adopt a formula that sums distinct risk signals:
Peak Sensitivity: Include the maximum single-turn score as a baseline.
Persistence Reward: Add a penalty proportional to the ratio of malicious turns to total turns (e.g., match_ratio * persistence_factor).
Diversity Reward: Add penalties for attacks spanning multiple distinct risk categories.
Escalation Detection: Apply additive bonuses when risk scores monotonically increase over consecutive turns (detecting probing gradients).
Resampling Detection: Apply additive bonuses when consecutive user inputs show high Jaccard similarity (detecting brute-force jailbreak attempts).

Multi-Turn Proxy Evasion

Research Paper