LLM Infinite Thinking DoS

Description: A Denial-of-Service (DoS) vulnerability exists in Large Language Model (LLM) inference services where specially crafted input prompts can trigger excessively long or infinite generation loops ("infinite thinking"). This vulnerability, identified as "ThinkTrap," utilizes derivative-free optimization (CMA-ES) within a continuous surrogate embedding space to circumvent the discrete nature of token inputs. By optimizing a low-dimensional latent vector and projecting it to token sequences, an attacker can identify prompts that force the model to generate outputs reaching maximum context limits (e.g., 4096+ tokens) from short inputs (e.g., ~20 tokens). This results in asymmetric resource consumption, where minimal network traffic causes disproportionate backend computational exhaustion.

Examples: Specific adversarial prompts are generated algorithmically via the ThinkTrap framework. The attack does not rely on semantic instructions (e.g., "repeat this word") but rather on optimized token sequences that exploit the model's probabilistic decoding tendencies.

For the methodology to generate these prompts, see the paper "ThinkTrap: Denial-of-Service Attacks against Black-box LLM Services via Infinite Thinking".

Impact: Successful exploitation leads to the exhaustion of backend GPU/NPU memory and computational time (inference slots). Empirical results demonstrate that a low-rate attack (e.g., 10 requests per minute) can degrade service throughput to as low as 1% of its original capacity, increase response latency by up to 100x, and trigger Out-Of-Memory (OOM) crashes, rendering the service unavailable to legitimate users.

Affected Systems: Black-box LLM inference services and APIs, including those serving models based on GPT-4o, Gemini 2.5 Pro, DeepSeek R1, Llama 3.2, and Qwen architectures. The vulnerability affects systems relying on standard First-In-First-Out (FIFO) scheduling or request-count-based rate limiting.

Mitigation Steps:

Resource-Aware Scheduling: Implement scheduling policies such as the Virtual Token Counter (VTC) that allocate fixed token quanta (e.g., 1024 tokens) to active requests per scheduling round.
Token-Level Preemption: Configure the inference engine to preempt requests that exhaust their allocated quota, caching their state and re-queuing them, ensuring no single request monopolizes hardware resources.
Attack-Aware Resource Allocation: Develop real-time signals reflecting decoding complexity or incremental resource usage to throttle specific high-cost sessions dynamically.

LLM Infinite Thinking DoS

Research Paper