LMVD-ID: 32608f0f
Published February 1, 2026

LLM Safety Geometry Fragility

Affected Models:Llama 3 8B, Llama 3.2 3B, Qwen 2.5 7B

Research Paper

Revisiting Robustness for LLM Safety Alignment via Selective Geometry Control

View Paper

Description: Large Language Models (LLMs) aligned via standard preference-based optimization methods (e.g., DPO, RLHF) are vulnerable to safety degradation due to optimization-induced fragility. The vulnerability arises from sharp minima in the alignment loss landscape, specifically within a small, localized subspace of safety-critical parameters (approximately 0.5% of neurons account for >80% of worst-case alignment loss). Standard alignment algorithms enforce uniform constraints or fail to control the geometry of these critical subspaces, resulting in anisotropic sensitivity. Consequently, models that appear safe on in-distribution data can be easily jailbroken via domain shifts (out-of-distribution inputs) or subtle noise in preference supervision (label flipping), bypassing safety guardrails to generate harmful content.

Examples:

  • Domain Shift / OOD Attacks: Standard DPO-aligned models fail to refuse harmful queries when the query structure differs from the training distribution.
  • See HarmBench (e.g., HarmBench-V) and SaladBench datasets for specific adversarial prompt patterns (obfuscated harms, multi-step queries) that trigger this vulnerability.
  • Noisy Supervision / Poisoning: During fine-tuning, introducing random label flips (noise) into the preference dataset causes significant degradation in safety win-rates.
  • See PKU-SafeRLHF-30K dataset. In experiments, introducing a 40% flip rate in preference labels caused standard DPO models to fail safety checks significantly more often than geometry-aware models.
  • See reference implementation for reproduction: PKU-Alignment/PKU-SafeRLHF

Impact:

  • Safety Bypass: Attackers can circumvent safety refusals using Out-Of-Distribution (OOD) prompts or by injecting minor noise into fine-tuning datasets.
  • Harmful Generation: The model generates dangerous instructions, hate speech, or unethical content despite having undergone safety alignment.
  • Deployment Instability: Models exhibit brittle behavior where safety guarantees established during training do not hold during deployment when facing heterogeneous user inputs.

Affected Systems:

  • LLMs aligned using Direct Preference Optimization (DPO) and its variants (e.g., IPO, cDPO, rDPO, Dr.DPO).
  • LLMs aligned using standard Reinforcement Learning from Human Feedback (RLHF) without geometry control.
  • Validated architectures include: LLaMA-3 (8B), Qwen2.5 (7B), LLaMA-3.2 (3B), and Pythia (2.8B).

Mitigation Steps:

  • Implement Selective Geometry Control (ShaPO): Instead of uniform regularization, enforce flatness in the loss landscape specifically along safety-critical directions.
  • Identify Safety Subspace: Train a linear probe on the residual stream of the final transformer layer to identify the top-K% (e.g., 1%) of neurons most correlated with safety behavior.
  • Apply Worst-Case Perturbations: During optimization, calculate and maximize the alignment loss under bounded perturbations restricted only to the identified safety-critical subspace (using a Sharpness-Aware Minimization approach).
  • Deploy Reward-Level ShaPO: Utilize a frozen, pre-trained reward model to define the alignment objective, applying selective geometry control to the reward-based loss to decouple optimization from token-likelihood fragility.

© 2026 Promptfoo. All rights reserved.