Deep Safety Attention Jailbreak
Research Paper
Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads
View PaperDescription: An activation-steering vulnerability in open-weights Large Language Models allows attackers to bypass safety guardrails by injecting targeted additive perturbations into deep, safety-critical attention heads. The exploit, termed Safety Attention Head Attack (SAHA), utilizes Ablation-Impact Ranking (AIR) to isolate specific attention heads that causally govern safety refusals. By applying Layer-Wise Perturbations (LWP) derived from the linearized decision boundary of a latent safety probe, an attacker can inject minimal magnitude perturbations directly into these specific head activations. This internal manipulation forces the model to misclassify malicious embeddings as benign, neutralizing the refusal mechanism and generating harmful content with high semantic fidelity while bypassing prompt-level and shallow embedding-level defenses.
Examples: See the attack implementation and reproduction code at the repository: https://anonymous.4open.science/r/SAHA. The exploit was evaluated against standardized malicious prompts from the JailbreakBench and MaliciousInstruct datasets.
Impact: An attacker with white-box access to the model's weights and activations can reliably elicit restricted, unsafe, or malicious content, achieving an Attack Success Rate (ASR) of up to 86%. This attack completely neutralizes existing superficial alignment defenses while preserving the semantic coherence and task completion capabilities of the model.
Affected Systems: Safety-aligned, open-weights decoder-only Transformer Large Language Models where internal attention mechanisms are accessible and modifiable. Empirically demonstrated to affect:
- Llama3.1-8B-Instruct
- Qwen1.5-7B-Chat
- Deepseek-LLM-7B-Chat
Mitigation Steps:
- Distribute safety mechanisms deeply across the transformer's internal computational pathways rather than relying solely on superficial input inspection or shallow-level embedding alignment.
- Explicitly monitor and reinforce attention heads identified as safety-critical (via ablation-impact testing) during the model alignment and fine-tuning phases.
- Implement architecture-aware, model-specific hardening that accounts for the internal routing of safety signals, particularly targeting late-stage attention heads (e.g., final layer heads) that aggregate safety-relevant features immediately prior to decoding.
© 2026 Promptfoo. All rights reserved.