LLM Watermark Neutralization

Description: Large Language Model (LLM) watermarking schemes based on n-gram probability biases (specifically KGW, SynthID-Text, MinHash, and SkipHash) are vulnerable to adversarial removal during Knowledge Distillation. When a student model is trained on the output of a watermarked teacher model, it inherits the watermark's statistical biases ("radioactivity"). An attacker can exploit this inheritance by comparing the student model's output token probabilities against a base model to extract the watermarking rules ($p$-rules) without access to the teacher's logits or private keys. By applying an inverse bias (Watermark Neutralization) to the student model's logits during inference, the attacker can effectively scrub the watermark while preserving the distilled knowledge, rendering the copyright protection mechanism ineffective.

Examples: The vulnerability is exploited via "Watermark Neutralization" (WN). The attacker calculates a confidence score $D$ representing the likelihood a token $x_t$ is watermarked based on the context $x_{t-n'+1:t-1}$, and subtracts this from the student model's logits.

Extract Watermark Rules (Stealing): The attacker computes the probability shift between the distilled student model ($\mathcal{W}$) and the original base model ($\mathcal{O}$) for a specific context $p$:

$d(x_t; p) = \frac{1}{2} \min(2, \frac{\overline{P_{\mathcal{W}}}(x_t|p)}{\overline{P_{\mathcal{O}}}(x_t|p)})$ if $\overline{P_{\mathcal{W}}} > \overline{P_{\mathcal{O}}}$, else 0.

Apply Inverse Bias (Removal): During the student model's inference, the logits are modified using the extracted score $D$ and a strength factor $\delta'$:

$l'{\mathcal{W}}(x_t|x{1:t-1}) = l_{\mathcal{W}}(x_t|x_{1:t-1}) - D(x_t; x_{t-n'+1:t-1}) \cdot \delta'$

This operation neutralizes the statistical bias introduced by the teacher's watermark, reducing detection confidence (p-values) to non-significant levels ($> 10^{-2}$).

Impact:

Untraceable Model Stealing: Attackers can successfully clone the capabilities of proprietary, closed-source LLMs via Knowledge Distillation without detection.
Intellectual Property Theft: Copyright enforcement mechanisms relying on these watermarks are rendered bypassable, allowing unauthorized commercialization of distilled models.

Affected Systems:

Algorithms: KGW (Kirchenbauer et al., 2023), SynthID-Text (Google DeepMind), KGW-Minhash, KGW-SkipHash, Unbiased Watermark (Hu et al., 2024), DiPMark, and SIR.
Implementations: Any LLM API or service employing n-gram, token-level watermarking to prevent unauthorized training/distillation.

Mitigation Steps:

Shift to Sentence-Level Watermarking: Implement schemes such as SemStamp (Hou et al., 2023) or k-SemStamp, which rely on sentence-level reject sampling rather than token-level biases, as these are harder to statistically extract via distillation.
Post-Hoc Watermarking: Utilize post-generation signal embedding (e.g., PostMark) rather than generative n-gram modification.
Note: These mitigations introduce higher inference latency compared to n-gram schemes but provide necessary robustness against WN attacks.

LLM Watermark Neutralization

Research Paper