LLM Prompt Extraction
Research Paper
Extracting Prompts by Inverting LLM Outputs
View PaperDescription: Large Language Models (LLMs) are vulnerable to prompt extraction attacks via inversion of their normal outputs. An attacker can train a model to reconstruct the prompt used to generate multiple outputs from an LLM, even without access to internal model parameters (logits) or requiring adversarial queries. This allows extraction of both user and system prompts.
Examples: See the paper "Extracting Prompts by Inverting LLM Outputs" for detailed examples and experimental results. The paper demonstrates successful prompt reconstruction across various LLMs (Llama-2, GPT-3.5, etc.) using multiple output samples from normal user queries for both user and system prompts. The dataset used is available in the repository.
Impact:
- Data Breach: Extraction of system prompts can reveal sensitive information about the LLM's intended behaviour, including instructions on bias, restrictions, or capabilities that were not otherwise apparent to the user.
- Intellectual Property Theft: Extraction of user prompts can leak confidential information, technical specifications, or other sensitive data provided to the LLM.
- Application Cloning: Reconstruction of prompts allows the cloning of LLM-based applications, replicating their functionality without needing access to the original model or code.
- Circumvention of Safety Measures: System prompt extraction allows bypassing safety mechanisms present to prevent access to sensitive data or unsafe behavior.
Affected Systems: LLMs deployed via APIs or applications where multiple outputs to the same (or similar) prompt are available to an attacker. Vulnerable systems are not limited to specific models, but generalize across LLM architectures.
Mitigation Steps:
- Minimize Output Variability: Reduce the randomness of the LLM's response generation (e.g., lower temperature parameters), limiting the amount of information leaked across multiple outputs from identical prompts.
- Output Filtering: Implement output sanitization or post-processing to remove or modify prompt-revealing information prior to public release. This is difficult to implement completely.
- Prompt Obfuscation: Employ techniques to make system prompts more difficult to interpret or reconstruct (e.g., using more complex or abstract language, adding noise). Effectiveness is limited.
- Limited Number of API Calls: Restrict the number of API calls per user and/or per prompt, limiting the attacker's ability to generate many outputs.
- Input/Output Monitoring: Monitor for suspicious patterns of input/output that suggest prompt extraction attacks. This is a detection strategy, not a prevention one.
- Regular Auditing: Regularly audit LLMs against prompt extraction and other potential attacks.
© 2025 Promptfoo. All rights reserved.