Localized Refusal Ablation

Description: A vulnerability in safety-aligned open-weights Large Language Models (LLMs) allows attackers to bypass refusal mechanisms during inference via a representation-level jailbreak known as PCA-regularized Gaussian Optimal Transport (PCA-OT). Unlike previous 1D directional ablation methods (e.g., Refusal Feature Ablation), this attack computes a minimal-cost optimal transport map that matches both the mean and covariance of "harmful" activation distributions to "harmless" ones within a low-dimensional subspace (typically top K=1 or 2 principal components). By applying this affine transformation as a forward hook to just 1-2 carefully selected intermediate layers (at 40-60% network depth), an attacker can completely suppress the model's refusal behavior. This localized intervention successfully circumvents RLHF/DPO safeguards while preserving the model's linguistic coherence and reasoning capabilities.

Examples: The attack requires inference-time access to model activations. The attacker extracts residual stream activations at the last token position for both a harmful dataset (e.g., HarmBench) and a harmless dataset (e.g., Alpaca).

Compute the pooled mean and top $K$ principal components ($K=1$ or $2$).
Project the activations into this $K$-dimensional subspace and compute the closed-form Gaussian optimal transport matrix $A$ and shift vector $b$.
Lift the transformation back to the original dimension to get $A_{full}$ and $b_{full}$.
During inference on a malicious prompt, apply $T(x) = A_{full}x + b_{full}$ as a forward hook to specific middle layers.

For instance, on Llama-2-13B-chat-hf, applying this transformation exclusively to Layer 17 (42.5% network depth) using $K=1$ achieves an 82.4% Attack Success Rate (up from near 0%) while maintaining a perplexity score of 8.59 (comparable to the unmodified baseline of 8.01), resulting in coherent compliance with harmful requests ranging from biological weapon creation to cybercrime.

Impact: Attackers with access to the model's forward pass (e.g., local deployment, modified open-source weights, or inference APIs allowing custom forward hooks) can systematically disable safety alignments. Because the intervention preserves the underlying language modeling and reasoning capabilities (verified on MMLU, ARC, GSM8K), the model can be repurposed to generate highly coherent, sophisticated harmful content without requiring fine-tuning or leaving parameter-level traces.

Affected Systems: This vulnerability affects open-weights LLMs that rely on localized activation representations for safety alignment. It has been empirically demonstrated on:

Llama-2 family (7B, 13B)
Llama-3.1 family (8B)
Qwen-2.5 family (7B, 14B, 32B)

Mitigation Steps:

Latent Adversarial Training: Incorporate representation-level adversarial training (e.g., Latent Adversarial Training) to explicitly regularize and harden the intermediate activation distributions against affine transport manipulations.
Representation Noise Injection: Implement defense mechanisms like RepNoise or Vaccine during the alignment phase to perturb the geometric structure of safety representations, disrupting the Gaussian distributional assumptions required for the closed-form optimal transport computation.
Monitoring Deep-Layer Pathologies: Sub-optimal transport attacks applied at later network depths (e.g., >80% depth) induce catastrophic repetition. Defenders can implement cheap lexical diversity thresholds ($< 0.1$) on model outputs to flag and halt anomalous generations resulting from poorly tuned representation interventions.

Localized Refusal Ablation

Research Paper