Logit Leakage Model Clone

Description: Large Language Model (LLM) inference APIs that expose top-k logits or log-probabilities are vulnerable to model extraction and cloning. An attacker can execute a two-stage attack to replicate the proprietary model without access to weights, gradients, or training data. First, by submitting fewer than 10,000 random queries and aggregating the returned unrounded logits, the attacker recovers the model's output projection matrix using Singular Value Decomposition (SVD). Second, the attacker freezes this recovered layer and uses knowledge distillation with a public dataset to train a compact "student" model. This results in a deployable clone that replicates the target model's internal hidden-state geometry and output behavior with high fidelity (e.g., 97.6% cosine similarity).

Examples: To reproduce the extraction of the projection matrix (Stage 1):

Query the target API with $n$ random prompts (where $n > d$, the hidden dimension) and request top_k logits in the response.
Construct a logit matrix $Q \in \mathbb{R}^{V \times n}$ from the responses.
Perform Singular Value Decomposition (SVD) on $Q$: $Q = U \Sigma V^{\top}$.
Estimate the projection matrix $\hat{W}$ by retaining the top-$d$ singular vectors: $\hat{W} = U_{:,1:d} \Sigma_{1:d,1:d}$.

To reproduce the cloning (Stage 2):

Initialize a student architecture (e.g., a 4-layer or 6-layer transformer).
Fix the embedding and projection layers using the $\hat{W}$ recovered in Stage 1.
Train the student model on a public dataset (e.g., WikiText) using the target API as the teacher.
Minimize the distillation loss function: $\mathcal{L}(p) = \tau^2 \mathrm{KL}(s_T \parallel s_S) + \lambda \mathrm{CE}(z_S, y)$, where $s_T$ are the soft targets from the API.

Impact:

Intellectual Property Theft: Testers can create functional substitutes for proprietary models at a fraction of the training cost (e.g., < 24 GPU hours).
Security Bypass: Attackers can use the local clone to conduct offline adversarial attacks (finding jailbreaks or adversarial examples) which are then transferrable to the production system.
Privacy Leakage: The clone captures the latent reasoning of the teacher, potentially exposing memorized private data or distinct behavioral fingerprints.

Affected Systems:

Any LLM Inference API (Cloud-based or On-premise) that returns logprobs, top_logprobs, or top_k distribution data in the API response payload.
Specific verified targets in research include distilGPT-2, with theoretical applicability to GPT-3.5-turbo and PaLM-2 based on pricing and query analysis.

Mitigation Steps:

Restrict API Output: Disable the return of full or top-k logits and log-probabilities in public-facing APIs. Only return the generated tokens.
Output Quantization/Rounding: If log-probabilities are required, apply aggressive rounding or quantization to reduce the precision available for SVD reconstruction.
Adaptive Noise Injection: Inject random noise into the returned logits to disrupt the singular value spectrum required for projection matrix recovery.
Watermarking: Implement robust watermarking schemes to detect outputs generated by unauthorized clones.
Behavioral Monitoring: Monitor for atypical query patterns, such as high-volume requests with high-entropy (random) prompts used for projection recovery.

Logit Leakage Model Clone

Research Paper