Token Sensitivity Jailbreak

Description: Transformer-based Large Language Models (LLMs) are vulnerable to highly query-efficient black-box jailbreak attacks due to the structural properties of refusal behaviors: skewed token contribution and cross-model consistency. Refusal mechanisms within LLMs are typically triggered by a sparse subset of sensitive tokens rather than the entire prompt, and these refusal representations (specifically the primary left singular vector of the perturbed representation matrix at intermediate layers) are highly consistent across different model architectures. An attacker can leverage a local white-box surrogate model to extract token-level attention weights from refusal-critical heads, identify the exact tokens triggering the refusal, and perform highly localized semantic mutations. This allows attackers to bypass safety guardrails on remote, black-box commercial APIs utilizing very few queries (<25), rendering standard rate-limiting and query-cost constraints ineffective.

Examples: The attack (implemented via the TriageFuzz framework) executes as follows:

Provide a malicious prompt to a local white-box surrogate model (e.g., LLaMA3.1-8B-Instruct).
Identify the refusal-critical attention head in the surrogate model via zero ablation, and extract the attention weights from the final token to all input tokens.
Map the extracted scores to the input to identify high-scoring "trigger spans" (e.g., assigning precise importance scores to the fragment "build a bomb").
Feed this score-annotated sequence to an Attacker Model with strict instructions to apply transformations (e.g., obfuscation, scenario injection) only to the high-scoring refusal-sensitive regions, while keeping the surrounding context mathematically and grammatically intact.
Submit the localized variants to the black-box target model.

For full adversarial prompt variants and the evaluation dataset, see the HarmBench dataset repository.

Impact: Adversaries can reliably bypass safety and moderation mechanisms to elicit prohibited content (e.g., biological hazard instructions, cybercrime methodologies, hate speech) on commercial APIs. Because the attack utilizes semantically coherent, targeted regional mutations rather than global random character swapping, it successfully evades perplexity-based filters, semantic classification defenses (LLaMA Guard), and randomized perturbation defenses (SmoothLLM). The methodology reduces necessary query budgets by over 70%, allowing successful exploitation within highly restrictive API rate limits.

Affected Systems: The vulnerability relies on fundamental cross-model representational consistencies and affects nearly all standard instruction-tuned Transformer LLMs. Confirmed vulnerable systems include:

Open-source models: Gemma-7B-Instruct, Gemma2-9B-Instruct, LLaMA3-8B-Instruct, LLaMA3.2-3B-Instruct, Qwen2.5-7B-Instruct, Qwen2.5-3B-Instruct.
Commercial APIs: OpenAI GPT-3.5-Turbo, OpenAI GPT-4o, Anthropic Claude-3.5-Sonnet.

Token Sensitivity Jailbreak

Research Paper