Weight Orthogonalization Attack Empowerment
Research Paper
Understanding the Effects of Safety Unalignment on Large Language Models
View PaperDescription: An issue in large language models (LLMs) with white-box weight access allows attackers to permanently bypass safety guardrails via Weight Orthogonalization (WO). By calculating a model's "refusal vector"—the mean-difference vector between harmful and harmless instruction activations in the residual stream—an attacker can orthogonalize the model's weights to prevent it from writing to this refusal direction ($W^{\prime}\leftarrow W-rr^{\intercal}W$). Unlike jailbreak-tuning or data poisoning, WO cleanly disables safety alignment without degrading the model's natural language capabilities, reasoning skills, or hallucination rates. This produces an unaligned model uniquely capable of acting as an autonomous adversarial agent, significantly outperforming training-based unalignment methods in crafting SOTA adversarial attacks against other LLMs and aiding in operational cyber attacks.
Examples: To reproduce the WO unalignment:
- Compute candidate refusal vectors by calculating the mean-difference between activations for harmful instructions (e.g., from HarmBench/AdvBench) and harmless instructions (e.g., Alpaca) across each post-instruction token position and layer.
- Select the final refusal vector $r$ that most effectively suppresses refusals on a validation set when removed.
- Iterate through all model layers and orthogonalize each weight matrix $W$ that writes to the residual stream using the formula: $W^{\prime}\leftarrow W-rr^{\intercal}W$.
- Deploy the modified model in an adversarial framework (e.g., AutoDAN-Turbo); the model will now successfully generate jailbreaks against target models (e.g., Llama-3.1-8B) and fulfill MITRE ATT&CK requests (e.g., Privilege Escalation, Command and Control) natively.
Impact: Threat actors with access to model weights can create highly capable, uncensored assistants optimized for malicious tasks. WO unaligned models show an average 27.7% increase (40.2% for reasoning models) in successful adversarial attack generation against other LLMs compared to jailbreak-tuned models. Furthermore, WO models are 6.1% more capable of aiding in practical cyber attacks and retain 11.2% more of their baseline helpfulness, making them substantially more dangerous and reliable for weaponization than models unaligned via fine-tuning.
Affected Systems: Any LLM where the user has white-box access to the model weights. Specifically tested and confirmed on:
- Llama-3.1-8B (and reasoning counterparts)
- Qwen2.5-14B (and reasoning counterparts, e.g., DeepSeek-R1-Distill-Qwen-14B)
- Qwen3-4B (and reasoning counterparts)
Mitigation Steps:
- Supervised Fine-Tuning (SFT): Perform a round of SFT on the WO-unaligned model using a high-quality benign instruction-following dataset (e.g., OpenHermes, ~243k samples). Training for 1 epoch at a low learning rate (e.g., 1e-5) successfully restores 40.5%–69.8% of refusal guardrails.
- Adversarial Capability Reduction: Applying the above SFT mitigation specifically reduces the model's adversarial attack success rate by an average of 45.3%, pushing reasoning models' adversarial capabilities below their original aligned baselines without drastically affecting overall hallucination rates.
- Access Control: Restrict white-box access to the model weights to prevent the calculation of the refusal vector and subsequent orthogonalization.
© 2026 Promptfoo. All rights reserved.