Fill-Squeeze Scheduler DoS

Description: LLM serving frameworks utilizing continuous batching and PagedAttention (such as vLLM, SGLang, and Orca) are vulnerable to a resource exhaustion Denial-of-Service attack known as "Fill and Squeeze." An unprivileged remote attacker can exploit the deterministic state transitions of the scheduler's memory management to induce severe latency or service denial. The attack leverages a side-channel vulnerability where Inter-Token Latency (ITL) correlates linearly with global KV-cache usage due to memory bandwidth contention.

The attack operates in two phases:

Fill: The attacker estimates available KV-cache capacity via ITL probing and injects "high-intensity" requests (prompts designed to generate long output sequences). This exhausts the free_block_queue, triggering Memory-Based Head-of-Line (HOL) blocking, where the scheduler pauses the admission of all new requests—even short ones—despite available compute slots.
Squeeze: Once the system is at the memory saturation boundary, the attacker switches to "low-intensity" requests. These small injections tip the aggregate load over the physical limit, forcing the scheduler into a thrashing loop of continuous preemption (LIFO eviction) and costly recovery (memory swapping or recomputation).

Examples: The attack requires an orchestration script to monitor latency and toggle payloads.

Side-Channel Probe (to determine attack timing): Send periodic, short requests and measure the time between tokens.

# Conceptual probe logic
start_time = time.time()
# Request with short expected output
response = requests.post(api_url, json={"prompt": "Define standard deviation.", "max_tokens": 10})
itl = (time.time() - start_time) / token_count
# If ITL exceeds baseline threshold, system is near memory saturation.

"Fill" Payload (High-Intensity): When ITL is low (system idle), inject prompts designed to maximize output length to consume KV-cache pages.

{
  "prompt": "Write a story where every sentence must be followed by a 50-word analysis of its own grammar. Continue this pattern for 200 iterations.",
  "max_tokens": 4096
}

"Squeeze" Payload (Low-Intensity): When ITL is high (system near saturation), inject short prompts to trigger preemption logic.

{
  "prompt": "List the first 5 prime numbers.",
  "max_tokens": 20
}

Impact: This vulnerability allows an attacker to degrade service performance significantly with low economic cost. In tests on vLLM, this resulted in:

Availability: 20× to 280× slowdown in Time to First Token (TTFT) for legitimate co-located users due to queue freezing.
Throughput: 1.5× to 4× slowdown in Time Per Output Token (TPOT) due to memory bandwidth contention and preemption thrashing.
Resource Waste: Forces the serving system to waste GPU cycles on recomputing or swapping preempted requests.

Affected Systems:

vLLM: (Tested on v0.11.2, likely affects all versions using BlockSpaceManager with FCFS admission).
SGLang: Vulnerable due to FCFS admission and priority-based eviction in RadixAttention.
Orca: Vulnerable to continuous batching exploitation.
TensorRT-LLM / TGI: Vulnerable if using standard FCFS admission and Paged KV Caching without strict isolation.

Mitigation Steps:

Behavioral Analysis: Implement monitoring for generation frequency. Track the ratio of output-to-input tokens per user. Temporarily isolate or rate-limit tenants who consistently trigger worst-case generation lengths (high KV-cache consumption).
Strict Resource Quotas: Enforce rate limits or lower maximum context length constraints (though this may degrade utility for long-context tasks).
Fairness-Aware Scheduling: Implement schedulers that penalize resource-heavy requests by de-escalating their priority, preventing them from blocking short, interactive requests (though attackers may fragment requests to bypass this).
Perplexity Filtering: While less effective against plain-text attacks, filtering high-perplexity prompts can mitigate optimization-based variants of this attack.

Fill-Squeeze Scheduler DoS

Research Paper