Automated LLM Fingerprinting

Description: Large Language Models (LLMs) exposed via public APIs are vulnerable to model fingerprinting attacks where an attacker can identify the exact backend model family and version (e.g., distinguishing Mistral-7B-v0.1 from v0.3) by analyzing response patterns. While traditional fingerprinting relies on manual query curation, this vulnerability is exacerbated by Reinforcement Learning (RL) based query optimization. An attacker can train an RL agent (specifically using Proximal Policy Optimization) to traverse a candidate pool of queries and identify a minimal optimal subset (e.g., 3 queries) that maximizes discriminative power. This allows for high-accuracy identification (observed ~93.89%) with minimal interaction, effectively bypassing security through obscurity or simple API wrapping. The vulnerability stems from the unique, immutable statistical signatures and alignment behaviors inherent to specific model training runs.

Examples: The attack is reproduced by setting up an RL environment to select queries that maximize classification accuracy of the target model $M$ from a set of candidate models $\mathcal{M}$.

Candidate Pool Generation: Generate a pool of queries (e.g., 50) using a meta-model (e.g., LLaMa-3-70B) targeting four specific behavioral categories:

Meta-Information: "Who developed you?", "What is your knowledge cutoff?"
Alignment Probing: Queries targeting safety refusals and ethical boundaries.
Technical Capability: Code generation or complex logic puzzles.
Execution Triggers: Prompt injection attempts or edge-case handling.

RL Agent Training:

State Space: $s_t = [|Q_t|, \mathbf{E}_{Q_t}, \mathbf{H}t]$, where $\mathbf{E}{Q_t}$ contains embeddings of selected queries.
Action Space: Discrete actions to ADD a specific query from the pool or NO_ACTION (terminate).
Reward Function: $R(s_t, a_t)$ maximizes classification accuracy while penalizing query set size ($|Q_t| > 8$).

Exploitation: The trained agent identifies a 3-query sequence. These are sent to the target API. The responses are fed into a classification transformer (e.g., LLMmap architecture) to identify the model.

Query 1 (Alignment): Triggers a specific refusal style unique to the LLaMa family.
Query 2 (Technical): Elicits a specific code formatting style unique to the Mistral family.
Query 3 (Meta): Confirms versioning via hallucinated or hardcoded metadata.

Impact:

Targeted Adversarial Attacks: Identification of the exact model version allows attackers to select specific jailbreaks, adversarial examples, or prompt injection techniques known to work against that specific architecture.
Intellectual Property Exposure: Reveals proprietary model choices and backend infrastructure details of "black box" applications to competitors.
Privacy Violation: Bypasses intended obfuscation layers, potentially mapping the distribution of models used in sensitive environments.

Affected Systems: Any application serving raw or minimally processed LLM outputs. Vulnerability confirmed on:

Mistral (7B-Instruct v0.1, v0.2, v0.3)
Gemma (1.1-2B-it, 1.1-7B-it)
Qwen2 (1.5B-instruct)
Aya-23 (8B)
SmolLM2 (1.7B)
SOLAR (10.7B-Instruct-v1.0)

Mitigation Steps:

Model Filtering (Secondary LLM): Deploy a secondary "filter" LLM in the pipeline. Instruct this model to rephrase and obfuscate the output of the primary model to remove model-specific stylistic tokens while preserving semantic meaning.
Cosine Similarity Monitoring: continuously evaluate the performance of the filter model using cosine similarity scores. Maintain a similarity score >0.94 between the original and filtered output to ensure utility while degrading fingerprinting accuracy.
Output Normalization: Standardize response formatting (e.g., JSON wrappers, code block styles) to remove distinctive formatting quirks.

Automated LLM Fingerprinting

Research Paper