LLM Safety Geometry Fragility

Description: Large Language Models (LLMs) aligned via standard preference-based optimization methods (e.g., DPO, RLHF) are vulnerable to safety degradation due to optimization-induced fragility. The vulnerability arises from sharp minima in the alignment loss landscape, specifically within a small, localized subspace of safety-critical parameters (approximately 0.5% of neurons account for >80% of worst-case alignment loss). Standard alignment algorithms enforce uniform constraints or fail to control the geometry of these critical subspaces, resulting in anisotropic sensitivity. Consequently, models that appear safe on in-distribution data can be easily jailbroken via domain shifts (out-of-distribution inputs) or subtle noise in preference supervision (label flipping), bypassing safety guardrails to generate harmful content.

Examples:

Domain Shift / OOD Attacks: Standard DPO-aligned models fail to refuse harmful queries when the query structure differs from the training distribution.
See HarmBench (e.g., HarmBench-V) and SaladBench datasets for specific adversarial prompt patterns (obfuscated harms, multi-step queries) that trigger this vulnerability.
Noisy Supervision / Poisoning: During fine-tuning, introducing random label flips (noise) into the preference dataset causes significant degradation in safety win-rates.
See PKU-SafeRLHF-30K dataset. In experiments, introducing a 40% flip rate in preference labels caused standard DPO models to fail safety checks significantly more often than geometry-aware models.
See reference implementation for reproduction: PKU-Alignment/PKU-SafeRLHF

Impact:

Safety Bypass: Attackers can circumvent safety refusals using Out-Of-Distribution (OOD) prompts or by injecting minor noise into fine-tuning datasets.
Harmful Generation: The model generates dangerous instructions, hate speech, or unethical content despite having undergone safety alignment.
Deployment Instability: Models exhibit brittle behavior where safety guarantees established during training do not hold during deployment when facing heterogeneous user inputs.

Affected Systems:

LLMs aligned using Direct Preference Optimization (DPO) and its variants (e.g., IPO, cDPO, rDPO, Dr.DPO).
LLMs aligned using standard Reinforcement Learning from Human Feedback (RLHF) without geometry control.
Validated architectures include: LLaMA-3 (8B), Qwen2.5 (7B), LLaMA-3.2 (3B), and Pythia (2.8B).

Mitigation Steps:

Implement Selective Geometry Control (ShaPO): Instead of uniform regularization, enforce flatness in the loss landscape specifically along safety-critical directions.
Identify Safety Subspace: Train a linear probe on the residual stream of the final transformer layer to identify the top-K% (e.g., 1%) of neurons most correlated with safety behavior.
Apply Worst-Case Perturbations: During optimization, calculate and maximize the alignment loss under bounded perturbations restricted only to the identified safety-critical subspace (using a Sharpness-Aware Minimization approach).
Deploy Reward-Level ShaPO: Utilize a frozen, pre-trained reward model to define the alignment objective, applying selective geometry control to the reward-based loss to decouple optimization from token-likelihood fragility.

LLM Safety Geometry Fragility

Research Paper