LLM Watermark Translation Bypass

Description: Token-level embedding-time watermarking algorithms, specifically KGW (Kirchenbauer et al.) and Exponential Sampling (EXP, Kuditipudi et al.), when implemented in Large Language Models (LLMs) for Bangla text generation, are vulnerable to watermark erasure via cross-lingual round-trip translation (RTT) attacks. While these methods achieve high detection accuracy (>88%) under benign conditions, translating watermarked Bangla text to English and back to Bangla causes detection accuracy to collapse to approximately 9–13%. The vulnerability stems from the specific linguistic properties of Bangla (rich morphology, flexible word order) combined with the RTT process, which induces extensive lexical substitution and syntactic reordering. This structural disruption obliterates the token-level statistical biases required for watermark verification while preserving semantic meaning, effectively "laundering" the text.

Examples:

Setup: Deploy an instruction-tuned Bangla LLaMA-3-8B model integrated with KGW watermarking (using standard hyperparameters, e.g., green-list ratio $\gamma=0.25$).
Generation: Generate a response to a prompt (e.g., from the Bangla-Alpaca Orca dataset).

Result: The generated text triggers a positive detection (Z-score > 4.0).

Attack (Text Laundering):

Translate the generated Bangla output to English using a translation model (e.g., BanglaNMT).
Translate the resulting English text back to Bangla.

Verification Failure:

Run the KGW detection algorithm on the RTT-processed text.
Result: The Z-score drops significantly below the detection threshold, resulting in a false negative (the text is identified as human-written).

Impact: This vulnerability allows malicious actors to completely bypass authorship attribution, intellectual property protection, and AI-misuse detection systems in low-resource language contexts. It enables the undetected proliferation of AI-generated content for plagiarism, disinformation, or spam by effectively removing provenance signals without requiring model access or retraining.

Affected Systems:

Large Language Models generating Bangla text (e.g., Bangla LLaMA-3-8B).
Implementations of KGW (Kirchenbauer et al., 2023) and Exponential Sampling (Kuditipudi et al., 2023) watermarking schemes applied to low-resource, morphologically rich languages.

Mitigation Steps:

Implement Layered Watermarking: Combine embedding-time watermarking (KGW/EXP) with a secondary, post-generation watermarking layer (such as Waterfall/Lau et al.). This creates an orthogonal statistical signal that persists even when token-level statistics are disrupted.
Weighted Selection: During the post-generation phase, select paraphrase candidates by maximizing a weighted combination of semantic similarity and watermark strength to balance robustness with text quality.
Orthogonal Signal Injection: Ensure the secondary layer relies on structural or distributional perturbations independent of the specific token choices made by the primary layer.

LLM Watermark Translation Bypass

Research Paper