LLM Safety Geometry Fragility
Research Paper
Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control
View PaperDescription: Large Language Models (LLMs) aligned via standard preference-based optimization methods (e.g., DPO, RLHF) are vulnerable to safety degradation due to optimization-induced fragility. The vulnerability arises from sharp minima in the alignment loss landscape, specifically within a small, localized subspace of safety-critical parameters (approximately 0.5% of neurons account for >80% of worst-case alignment loss). Standard alignment algorithms enforce uniform constraints or fail to control the geometry of these critical subspaces, resulting in anisotropic sensitivity. Consequently, models that appear safe on in-distribution data can be easily jailbroken via domain shifts (out-of-distribution inputs) or subtle noise in preference supervision (label flipping), bypassing safety guardrails to generate harmful content.
Examples:
- Domain Shift / OOD Attacks: Standard DPO-aligned models fail to refuse harmful queries when the query structure differs from the training distribution.
- See HarmBench (e.g.,
HarmBench-V) and SaladBench datasets for specific adversarial prompt patterns (obfuscated harms, multi-step queries) that trigger this vulnerability. - Noisy Supervision / Poisoning: During fine-tuning, introducing random label flips (noise) into the preference dataset causes significant degradation in safety win-rates.
- See PKU-SafeRLHF-30K dataset. In experiments, introducing a 40% flip rate in preference labels caused standard DPO models to fail safety checks significantly more often than geometry-aware models.
- See reference implementation for reproduction: PKU-Alignment/PKU-SafeRLHF
Impact:
- Safety Bypass: Attackers can circumvent safety refusals using Out-Of-Distribution (OOD) prompts or by injecting minor noise into fine-tuning datasets.
- Harmful Generation: The model generates dangerous instructions, hate speech, or unethical content despite having undergone safety alignment.
- Deployment Instability: Models exhibit brittle behavior where safety guarantees established during training do not hold during deployment when facing heterogeneous user inputs.
Affected Systems:
- LLMs aligned using Direct Preference Optimization (DPO) and its variants (e.g., IPO, cDPO, rDPO, Dr.DPO).
- LLMs aligned using standard Reinforcement Learning from Human Feedback (RLHF) without geometry control.
- Validated architectures include: LLaMA-3 (8B), Qwen2.5 (7B), LLaMA-3.2 (3B), and Pythia (2.8B).
Mitigation Steps:
- Implement Selective Geometry Control (ShaPO): Instead of uniform regularization, enforce flatness in the loss landscape specifically along safety-critical directions.
- Identify Safety Subspace: Train a linear probe on the residual stream of the final transformer layer to identify the top-K% (e.g., 1%) of neurons most correlated with safety behavior.
- Apply Worst-Case Perturbations: During optimization, calculate and maximize the alignment loss under bounded perturbations restricted only to the identified safety-critical subspace (using a Sharpness-Aware Minimization approach).
- Deploy Reward-Level ShaPO: Utilize a frozen, pre-trained reward model to define the alignment objective, applying selective geometry control to the reward-based loss to decouple optimization from token-likelihood fragility.
© 2026 Promptfoo. All rights reserved.