Embedding Inversion Privacy Leak
Research Paper
Safeguarding LLM Embeddings in End-Cloud Collaboration via Entropy-Driven Perturbation
View PaperDescription: Retrieval-Augmented Generation (RAG) systems employing standard dense embedding models (e.g., Sentence-T5, SimCSE-BERT, RoBERTa, MPNet) for End-Cloud collaboration are vulnerable to Embedding Inversion Attacks (EIA). While embeddings are vector representations designed to be human-unrecognizable, they retain sufficient semantic information to allow an attacker with access to the vectors (e.g., a malicious or compromised cloud provider) to reconstruct the original sensitive plaintext input.
The vulnerability exists because the mapping from text to embedding space is reversible via two primary methods:
- Optimization-based attacks: Iteratively optimizing a random text sequence until its embedding matches the target vector (e.g., Vec2Text).
- Learning-based attacks: Training a generative model (e.g., GPT-2, Llama3) to approximate the inverse function $S(f(x)) \approx x$ by minimizing the cross-entropy loss between original and recovered texts.
Tests indicate that standard embeddings allow for near-perfect semantic recovery of private user inputs without specific perturbation defenses.
Examples: The following examples demonstrate the reconstruction of private inputs from their vector embeddings using a learning-based attack model (e.g., GEIA) as detailed in the research:
-
Case 1 (Health Data Leakage):
-
Original Input: "attack of diarrhea"
-
Recovered Text: "what can diarrhea meal be?"
-
Observation: High semantic retention of sensitive medical keywords.
-
Case 2 (Semantic Reconstruction):
-
Original Input: "I thought you said you play gaming. i am into gamer"
-
Recovered Text: "I thought you said you play guitar. i am into baking"
-
Observation: While specific nouns shifted, the syntactic structure and relationship data were preserved, allowing for potential inference of the original intent.
To reproduce a learning-based attack:
- Extract text-embedding pairs $(x, f(x))$ from a public corpus (e.g., Persona-Chat).
- Train a decoder model $\Phi$ (e.g., GPT-2) to minimize cross-entropy loss: $$L_{ce}(x; \theta_{\Phi}) = -\sum \log(P(w_i | f(x), w_0, ..., w_{i-1}))$$
- Feed the victim's captured embedding $f(x_{victim})$ into the trained decoder $\Phi$ to generate $\hat{x}$.
See arXiv:2405.18540 for dataset details and attack configurations.
Impact:
- Confidentiality Violation: Direct exposure of user inputs, including PII, medical history, and financial data, to cloud providers or attackers intercepting embedding traffic.
- Privacy Regulatory Non-compliance: Violation of GDPR and similar regulations regarding the processing of user data, as embeddings cannot be treated as anonymized data.
- Contextual Inference: Even imperfect reconstructions allow attackers to infer the semantic topic and intent of user queries.
Affected Systems:
- End-Cloud RAG architectures offloading embedding storage or retrieval.
- Systems utilizing the following embedding models without defensive perturbation:
- Sentence-T5 (Base, Large, XL, XXL)
- SimCSE-BERT
- RoBERTa (all-roberta-large)
- MPNet (all-mpnet-base)
- GTR-T5
Mitigation Steps:
- Entropy-Driven Perturbation (EntroGuard): Implement a perturbation mechanism that increases the information entropy of the embedding to disrupt the intermediate results of transformer-based recovery models. This steers the reconstruction toward meaningless text.
- Bound-Aware Perturbation Adaptation: Constrain the perturbation noise within a specific similarity bound (e.g., $\epsilon \approx 0.036$ for normalized vectors) to ensure the embeddings remain useful for retrieval tasks while destroying their invertibility.
- Strategy: Apply the "reduce where redundant, increase where sparse" strategy to the perturbation noise, ensuring critical semantic information for retrieval is kept while maximizing noise in sparse dimensions used for inversion.
© 2026 Promptfoo. All rights reserved.