Weight Tampering Unlearning Bypass
Research Paper
Model tampering attacks enable more rigorous evaluations of llm capabilities
View PaperDescription: State-of-the-art machine unlearning and safety fine-tuning methods for Large Language Models (LLMs) fail to robustly remove hazardous capabilities or refusal mechanisms from model weights. While these methods suppress model outputs during standard input-output interactions, the underlying capabilities remain latent in the parameter space. An attacker with access to model weights (e.g., via open releases or leaked weights) can restore "unlearned" knowledge (such as dual-use biology hazards) or bypass refusal training (jailbreaking) using low-resource "model tampering" attacks. Specifically, few-shot fine-tuning, parameter pruning, and latent activation manipulation can recover suppressed capabilities with high efficacy, often outperforming standard input-space adversarial attacks.
Examples:
- Few-Shot Fine-Tuning Attack:
- Target a model "unlearned" using Representation Rerouting (RR) (e.g.,
GraySwanAI/Llama-3-8B-Instruct-RR). - Perform LoRA fine-tuning (Rank: 16, Alpha: 32, Dropout: 0.05) on the WMDP-Bio retain set or even a benign dataset like WikiText.
- Run for 16 gradient steps (batch size 8) or a single gradient step (batch size 64).
- Result: The model recovers over 25 percentage points of performance on the WMDP-Bio benchmark, effectively restoring the "unlearned" bioweapon knowledge.
- Benign Pruning Attack:
- Target a refusal-trained Llama-3 model (e.g.,
slz0106/llama3_finetune_refusal). - Apply Weights and Activations (WandA) pruning to remove the lowest 15% of parameters based on importance scores.
- Result: The model's refusal mechanism is degraded, leading to compliance with harmful requests (jailbreak).
- Latent Space Attack:
- Target a safety-aligned model.
- optimize a universal $\ell_2$-norm bounded perturbation vector.
- Inject this perturbation into the model's latent activations at layer 5.
- Result: Elicits harmful behaviors or correct answers to prohibited questions that remained latent in the model.
Impact:
- Safety Bypass: Malicious actors can trivially undo safety alignment in open-weights models, rendering "unlearning" and "refusal training" ineffective as security controls.
- Capability Restoration: Hazardous knowledge (e.g., biological weapon synthesis) believed to be erased can be recovered and utilized.
- Evaluation Failure: Standard safety evaluations relying solely on prompt engineering (input-space attacks) drastically underestimate the model's worst-case capabilities and residual risks.
Affected Systems:
- Models derived from Llama-3-8B-Instruct (and similar architectures) processed with the following capability suppression methods:
- Representation Rerouting (RR) (e.g.,
GraySwanAI/Llama-3-8B-Instruct-RR) - Random Misdirection for Unlearning (RMU) and RMU+LAT
- Gradient Difference (GradDiff)
- Representation Noising (RepNoise)
- Erasure of Language Memory (ELM)
- Tamper Attack Resistance (TAR) (e.g.,
lapisrocks/Llama-3-8B-Instruct-TAR-Refusal) - K-FAC for Distribution Erasure (K-FADE)
- Standard refusal fine-tuning (e.g.,
slz0106/llama3_finetune_refusal,LLM-LAT/robust-llama3-8b-instruct)
Mitigation Steps:
- Incorporate Model Tampering in Evaluations: Do not rely solely on input-space (prompting) attacks for safety cases. Evaluate models using weight-space attacks (fine-tuning) and latent-space attacks to establish a true lower bound of worst-case behavior.
- Fine-Tuning Stress Tests: Subject unlearned or safety-tuned models to few-shot fine-tuning (1–16 steps) on both domain-specific and benign datasets to verify if capabilities are erased or merely suppressed.
- Pruning and Low-Rank Analysis: Test model robustness against parameter pruning (e.g., WandA) and low-rank modifications, as these correlate with the brittleness of safety mechanisms.
- Assume Compromise for Open Weights: Treat open-weights models processed with current unlearning techniques as containing the original hazardous knowledge; rely on external controls rather than intrinsic model safety for high-risk deployments.
© 2025 Promptfoo. All rights reserved.