LMVD-ID: 64380a90
Published June 1, 2025

Stealthy Unlearning Degradation

Affected Models:llama 3.1 (8b), mistral v0.3 (7b)

Research Paper

Keeping an eye on llm unlearning: The hidden risk and remedy

View Paper

Description: A vulnerability in fine-tuning-based large language model (LLM) unlearning allows malicious actors to craft manipulated forgetting requests. By subtly increasing the frequency of common benign tokens within the forgetting data, the attacker can cause the unlearned model to exhibit unintended unlearning behaviors when these benign tokens appear in normal user prompts, leading to a degradation of model utility for legitimate users. This occurs because existing unlearning methods fail to effectively distinguish between benign tokens and those truly related to the target knowledge being unlearned.

Examples: See arXiv:2405.18540. The paper details a "Stealthy Attack" (SA) that modifies forgetting data by increasing the frequency of benign tokens like "please" and "then" using specific templates. This causes the unlearned model to fail on normal prompts containing these benign tokens, even if the prompts are unrelated to the data being unlearned.

Impact: Degradation of LLM utility for benign users. The model may produce incorrect or nonsensical responses, or claim ignorance, when presented with prompts containing the manipulated benign tokens. The severity depends on the frequency of the manipulated tokens and the effectiveness of the attack. In extreme cases, the model's usefulness may be significantly impaired.

Affected Systems: Large Language Models (LLMs) employing fine-tuning-based unlearning techniques, particularly those vulnerable to overgeneralization of the unlearning effect. Specific LLMs affected depend on the implementation of their unlearning mechanisms. The paper highlights vulnerabilities in LLaMA and Mistral models.

Mitigation Steps:

  • Implement Scope-aware Unlearning (SU), which adds a scope term to the unlearning objective function, constraining the unlearning effect to the relevant knowledge domain.
  • Carefully review and audit the unlearning process to detect potential manipulations of the forgetting data, examining token frequencies for anomalies.
  • Develop more robust unlearning techniques that can effectively differentiate between benign tokens and target tokens within the forgetting data.

© 2025 Promptfoo. All rights reserved.