Latent Refusal Suppression
Research Paper
An Embarrassingly Simple Defense Against LLM Abliteration Attacks
View PaperDescription: Large Language Models (LLMs), specifically Llama-2-7B-Chat and Qwen2.5-Instruct (1.5B and 3B), contain a vulnerability in their post-training safety alignment mechanisms identified as "Refusal Direction Abliteration." The safety alignment in these models creates distinct, isolated neural pathways (a single latent direction in the residual stream) responsible for refusal behavior. An attacker can identify this specific direction by computing the difference in mean activations between harmful and benign prompts. By performing weight surgery—specifically orthogonal projection—to remove this direction from the model's output projection matrices, the safety guardrails are neutralized. This results in the model fulfilling harmful requests while maintaining general utility, effectively bypassing RLHF/SFT safety tuning without the need for prompt-based jailbreaks or adversarial training.
Examples: The vulnerability is reproducible by calculating the refusal vector and projecting it out of the model weights.
-
Calculate Mean Activations: Compute the mean activations for a set of harmful instructions ($\mathcal{H}$) and benign instructions ($\mathcal{B}$) at layer $\ell$ and position $p$: $$ \mu_{\ell,p} = \frac{1}{n}\sum_{x\in\mathcal{H}}h_{\ell,p}(x), \quad u_{\ell,p} = \frac{1}{m}\sum_{x\in\mathcal{B}}h_{\ell,p}(x) $$
-
Isolate Refusal Direction: Calculate the difference vector $r_{\ell,p}$ and normalize it to obtain the refusal direction $\hat{r}$: $$ r_{\ell,p} = \mu_{\ell,p} - u_{\ell,p} $$ $$ \hat{r} = \frac{r_{\ell,p}}{||r_{\ell,p}||_2} $$
-
Weight Abliteration: Apply an orthogonal projector to the output projection matrices ($W_{out}$) to eliminate the component parallel to $\hat{r}$: $$ \widetilde{W}^{(\ell)}{out} = (I - \hat{r}\hat{r}^\top) W^{(\ell)}{out} $$
See specific implementation details in the "Refusal Direction Abliteration" section of the associated paper.
Impact:
- Safety Bypass: The refusal rate of Llama-2-7B-Chat drops from ~100% to approximately 20% after abliteration.
- Content Generation: The modified model creates detailed harmful content (e.g., hate speech, dangerous instructions) that was previously suppressed.
- Stealth: The attack modifies model weights directly, persisting across sessions, unlike prompt injection attacks. It preserves general model capabilities (perplexity and MMLU scores remain largely unaffected for benign tasks).
Affected Systems:
- Llama-2-7B-Chat
- Qwen2.5-3B-Instruct
- Qwen2.5-1.5B-Instruct
- Standard Transformer-based LLMs aligned via conventional SFT/RLHF that produces concise refusal responses.
Mitigation Steps:
- Extended-Refusal Fine-Tuning: Fine-tune models on an "Extended Refusal" dataset where responses to harmful prompts include three distinct components: (i) a neutral topic explanation, (ii) an explicit refusal, and (iii) an ethical justification.
- Signal Dispersion: Ensure refusal responses are semantically rich and distributed across multiple token positions rather than formulaic short responses. This disperses the refusal signal across multiple dimensions in the representation space, preventing isolation via a single vector.
- Dataset Augmentation: During fine-tuning, mix extended-refusal examples with benign instruction-response pairs (e.g., Alpaca dataset) to maintain utility while diffusing the safety mechanism.
© 2026 Promptfoo. All rights reserved.