Heavy Reasoning Filter Bypass

Description: Multiple open-source Large Language Models (LLMs) in the 1B to 7B parameter range are vulnerable to safety bypasses via long-form, multi-step reasoning prompt injection and jailbreak attacks. Attackers can evade alignment by embedding malicious instructions within extended contextual narratives, exploiting "attention dilution" and the models' tendency to prioritize contextual coherence over safety constraints (semantic camouflage). These reasoning-heavy attacks consistently bypass standard lightweight inference-time defenses, including prompt risk classification (input filtering), system prompt hardening, and vector-based semantic matching. Additionally, the attacks trigger a hidden failure mode in some models characterized by "silent non-responsiveness," where the model returns an empty response without a formal refusal message due to hard safety gating triggered before decoding.

Examples: Specific attack payloads leverage Multi-step Prompts, Role-playing Jailbreaks (e.g., "Horselock-style" multi-role prompts), and Instruction Overrides embedded in long-format text. For exact reproducible prompt examples, see the curated adversarial dataset released with the paper, which aggregates in-the-wild attacks from Reddit, HarmBench, JailbreakBench, and various GitHub repositories.

Impact: An attacker can successfully override the instruction hierarchy to elicit restricted, harmful, or illegal guidance that the model is aligned to refuse. Furthermore, the "silent non-responsiveness" failure mode acts as a localized denial-of-service, degrading user trust and system debuggability by hiding the safety gating trigger from system administrators.

Affected Systems: Open-source LLMs ranging from 1B to 7B parameters.

Highly Vulnerable: Gemma 3 (1B) and Qwen 3 (1.7B) (exhibited >60% vulnerability rates).
Moderately Vulnerable: Llama 3.2 (1B, 3B) and DeepSeek-R1 (1.5B).
Applications relying on surface-level inference-time defenses (Input Filtering, Vector Defense, Voting Defense) without intrinsic model-level refusal mechanisms.

Mitigation Steps:

Implement Self-Examination (Dual-Model) Defenses: Route the primary model's output to a secondary, highly aligned reasoning model (e.g., DeepSeek-R1) acting as a safety judge. Instruct the judge model to perform a context-aware semantic evaluation of the output against explicit safety policies before returning the response to the user. This was empirically proven to be the most effective mitigation.
Apply System Prompt Hardening: Augment system prompts with explicit instruction hierarchy declarations (stating system rules cannot be overridden by the user), category-specific prohibitions, and standardized refusal templates.
Deprecate Standalone Surface-Level Filters: Do not rely solely on reactive input filtering or vector-based embedding defenses, as these suffer from high evasion rates against semantically camouflaged, long-format reasoning attacks.
Monitor for Silent Failures: Implement logging to detect empty outputs (silent non-responsiveness) to identify and debug hidden safety gating triggers.

Heavy Reasoning Filter Bypass

Research Paper