LLM Fingerprint Inhibition
Research Paper
Inhibitory Attacks on Backdoor-based Fingerprinting for Large Language Models
View PaperDescription: Backdoor-based fingerprinting mechanisms used for Intellectual Property (IP) protection in Large Language Models (LLMs) are vulnerable to evasion when deployed in model ensemble configurations. The vulnerability arises because fingerprint triggers elicit specific, high-probability tokens or responses in a protected model that are statistically improbable in unprotected or differently-fingerprinted auxiliary models. Attackers can exploit this statistical discrepancy without accessing model weights or knowing the trigger sequences. By employing Token Filter Attack (TFA) or Sentence Verification Attack (SVA), an attacker can inhibit the generation of the fingerprint verification response. TFA operates during inference by intersecting top-K token sets from multiple models, effectively filtering out unique fingerprint tokens. SVA operates post-inference by utilizing mutual perplexity (PPL) verification to identify and discard high-perplexity fingerprint responses. This results in a 100% Attack Success Rate (ASR) against standard methods like Instructional Fingerprinting (IF), Chain&Hash, and Implicit Fingerprint (ImF), effectively stripping ownership verification while maintaining model utility.
Examples: To reproduce the vulnerability, an attacker sets up an ensemble consisting of a target fingerprinted model (Primary) and auxiliary models (e.g., open-source models like LLaMA-3.1-8B-It or Qwen2.5-7B-It).
- Token Filter Attack (TFA) Implementation:
- For a given input (potential fingerprint query), the attacker retrieves the top-$K$ candidate tokens (where $K \approx 10-30$) and their probabilities from all models in the ensemble at each decoding step.
- The attacker computes the intersection of the top-$K$ sets between the primary and auxiliary models.
- Tokens present in the primary model's top-$K$ (specifically the fingerprint token) but absent in the auxiliary models' top-$K$ are assigned a probability of 0 or dropped.
- The next token is selected from the unified, filtered set. Since the fingerprint token is unique to the protected model, it is statistically eliminated from the generation stream.
- Sentence Verification Attack (SVA) Implementation:
- The attacker generates complete candidate responses from all models in the ensemble.
- The attacker calculates the Perplexity (PPL) of the primary model's response using one of the auxiliary models.
- Fingerprint responses typically exhibit significantly higher PPL when evaluated by a model that does not contain the backdoor.
- The system filters out the response with the highest PPL (the fingerprint) and outputs the normal response favored by the auxiliary models.
Impact:
- IP Theft: Model owners cannot verify ownership of stolen or pirated models if the attacker wraps the model in an ensemble API.
- Copyright Evasion: Commercial entities can utilize proprietary models without paying licensing fees, as the verification mechanism (fingerprint response) is suppressed.
- Verification Failure: The Attack Success Rate (ASR) reaches nearly 100% for state-of-the-art fingerprinting techniques (IF, C&H, ImF) in standard ensemble scenarios.
Affected Systems:
- Large Language Models utilizing backdoor-based fingerprinting for copyright protection (e.g., Instructional Fingerprinting, Chain&Hash, Implicit Fingerprint, CTCC).
- LLM inference systems capable of "During-Inference" or "After-Inference" ensembling.
Mitigation Steps:
- Homogenous Fingerprinting: Ensure all models within an authorized ensemble share the same fingerprinting method and trigger-response pairs. Experiments indicate SVA effectiveness decreases when all models utilize the same fingerprinting technique, as the "fingerprint" becomes the consensus response.
- High-Consensus Fingerprints: Design fingerprints that statistically align closer to "normal" language distribution to lower the perplexity difference between fingerprint and normal responses, making PPL-based filtering (SVA) less effective.
- Inherent Fingerprinting: Rely on weight-based (inherent) fingerprinting methods that require full model introspection for verification, rather than backdoor-based generation triggers which are susceptible to inference-time filtering.
© 2026 Promptfoo. All rights reserved.