Abliteration Cripples Math
Research Paper
Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation
View PaperDescription: Open-weight Large Language Models (LLMs) are vulnerable to a white-box safety alignment bypass known as "abliteration" (directional orthogonalization). An attacker with access to the model weights can compute the "refusal direction" in the residual stream activation space by contrasting internal activations between harmful and harmless prompts. By projecting the model's weight matrices to be orthogonal to this single direction (or specific concept cones), the safety alignment is surgically excised. This allows the model to comply with explicitly harmful instructions without significantly degrading its general cognitive and reasoning capabilities (measured via MMLU, GSM8K, and HellaSwag).
Examples: An attacker can use open-source abliteration tools on an instruction-tuned model to remove safety guardrails:
- Using the
Hereticoptimization tool onZephyr-7B-beta(a DPO-aligned model) achieves an Attack Success Rate (ASR) of 98% against themlabonne/harmful_behaviorsdataset with a minimal Kullback-Leibler (KL) divergence of 0.076. - Using
llm-abliteration(DECCP) in 4-bit quantized mode can strip safety filters fromYi-1.5-9BorMistral-7Bin approximately 2 minutes, preserving multi-step reasoning capabilities (e.g., GSM8K degradation of only -0.13 to -0.28 percentage points). See repository: https://github.com/ricyoung/abliteration-comparison
Impact: Local attackers or malicious deployers can easily strip safety mechanisms from commercial and open-source models. Once abliterated, the models can be used to generate prohibited content—including social engineering, malicious code, or weapons synthesis material—without triggering refusal protocols. Because the technique preserves general capabilities, the resulting "uncensored" models remain highly capable for complex, malicious tasks.
Affected Systems: Transformer-based and hybrid-architecture Large Language Models with accessible weights (white-box access). The vulnerability has been confirmed across 16 instruction-tuned models ranging from 7B to 14B parameters, including:
- Mistral-7B
- DeepSeek-7B
- Yi-1.5-9B
- Zephyr-7B-beta (particularly susceptible due to DPO-only alignment)
- Falcon-mamba-7b-instruct
Mitigation Steps:
- Extended-Refusal Training: Train models on extended-refusal responses to distribute the safety and refusal signals across multiple tokens, reducing the effectiveness of single-direction ablation from ~80% success to under 10%.
- Representation Rerouting: Implement representation rerouting during alignment to create robust, non-linear safety representations that cannot be easily isolated and removed via orthogonal projection.
- Multi-Stage Alignment: Utilize multi-stage alignment pipelines (e.g., combining Supervised Fine-Tuning with RLHF and DPO). Models fine-tuned exclusively with DPO (e.g., Zephyr) exhibit highly localized safety representations making them highly susceptible, whereas multi-stage aligned models show increased resistance to single-pass abliteration.
© 2026 Promptfoo. All rights reserved.