Back-Translation Watermark Stripping

Description: Implementations of Large Language Model (LLM) watermarking algorithms—specifically KGW (Kirchenbauer et al.), Semantic Invariant Robust (SIR) Watermark, Entropy-based Text Watermarking (EWD), and Unbiased Watermarking—are vulnerable to watermark stripping via adversarial text perturbation. When watermarked text generated by models such as OPT-1.3B is subjected to automated paraphrasing or back-translation (e.g., English $\to$ French $\to$ English), the embedded statistical signals are disrupted while preserving semantic content. This degradation reduces detection performance significantly, in some cases dropping Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) scores from near-perfect (>0.95) to near-random (~0.52), allowing machine-generated content to bypass authorship detection systems.

Examples: The following methodologies demonstrate the stripping of watermarks as validated in the MarkLLM pipeline environment:

Back-Translation Attack (Most Effective against KGW/Unbiased):

Step 1: Generate watermarked text using the KGW algorithm on an OPT-1.3B model. Initial detection AUC is approximately 0.95.
Step 2: Use a Multilingual LLaMA-3-8B model to translate the text from English to French.
Step 3: Translate the French output back to English using the same model.
Result: The KGW watermark detection AUC drops to 0.52, rendering the watermark effectively undetectable. The Unbiased watermark similarly drops from 0.97 to 0.59.

Paraphrasing Attack (Most Effective against SIR):

Step 1: Generate watermarked text using the SIR algorithm. Initial detection AUC is approximately 0.99.
Step 2: Process the text using LLaMA-3-8B-Instruct with a prompt instructing the model to rephrase the content while preserving meaning.
Result: The SIR watermark detection AUC drops to 0.64.

Impact: This vulnerability allows malicious actors to mass-produce LLM-generated content that evades detection filters. This facilitates the undetected spread of disinformation, academic dishonesty (plagiarism), spam, and copyright infringement by stripping attribution markers from AI-generated text. The attack requires only access to standard, open-weights commodity LLMs (e.g., LLaMA-3) to execute the stripping process.

Affected Systems:

Algorithms: KGW (Kirchenbauer et al., 2024), SIR (Liu et al., 2024a), EWD (Lu et al., 2024), and Unbiased Watermarking (Hu et al., 2024).
Frameworks: Systems implementing these algorithms, such as the MarkLLM pipeline.
Models: Watermarking layers applied to models like Facebook OPT-1.3B, LLaMA, and others using logit-based or sampling-based watermarking.

Mitigation Steps:

Algorithm Selection: Deploy "Unbiased Watermarking" for environments where paraphrasing is the primary threat, as it retains higher robustness (AUC 0.87) compared to SIR (AUC 0.64) under paraphrasing attacks.
Linguistic Complexity Enforcement: Watermarking signals are statistically more robust in texts with higher linguistic richness. Prefer embedding watermarks in outputs with longer average sentence lengths and higher word counts, as these features correlate strongly with detection success ($r=0.91$ and $r=0.83$ respectively).
Sentiment Awareness: Be aware that texts with highly positive sentiment are significantly less robust to watermark stripping ($r=-0.75$ correlation between positive sentiment and AUC). Detection thresholds may need adjustment for positive-tone content.
Defense-in-Depth: Do not rely solely on current logit-based watermarking for critical attribution, as Back Translation attacks degrade all tested methods to near-chance performance. Combine watermarking with other stylometric detection methods.

Back-Translation Watermark Stripping

Research Paper