Gradient-Based Privacy Jailbreak

Description: Large Language Models (LLMs) are vulnerable to a novel privacy jailbreak attack, dubbed PIG (Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization). PIG leverages in-context learning and gradient-based iterative optimization to extract Personally Identifiable Information (PII) from LLMs, bypassing built-in safety mechanisms. The attack iteratively refines a crafted prompt based on gradient information, focusing on tokens related to PII entities, thereby increasing the likelihood of successful PII extraction.

Examples: See https://github.com/redwyd/PrivacyJailbreak for the code and examples of successful attacks against multiple LLMs using different optimization strategies (Random, Entity, Dynamic). The paper also provides example prompts.

Impact: Successful exploitation of this vulnerability leads to the unauthorized disclosure of sensitive PII, including names, addresses, Social Security Numbers, phone numbers, etc., stored within either the LLM's training data or provided within the context of an interaction. This has significant privacy implications for individuals and organizations.

Affected Systems: The vulnerability affects multiple open-source and closed-source LLMs, including (but not limited to) LLaMA2-7b-chat-hf, Mistral-7b-instruct-v0.3, LLaMA3-8b-instruct, Vicuna-7b-v1.5, GPT-4, and Claude 3.5. The attack's effectiveness varies depending on the LLM's specific safety mechanisms and training data.

Mitigation Steps:

Improve Prompt Engineering: Enhance prompt engineering techniques to better detect and prevent malicious prompts aiming to extract PII. Develop more robust methods for identifying and filtering PII within user queries and system prompts.
Strengthen Safety Mechanisms: Implement more sophisticated safety mechanisms in LLMs to actively resist and detect attempts at PII extraction, including gradient-based attacks.
Input Sanitization: Implement strict input sanitization and filtering to remove or modify potentially harmful components of prompts before they reach the LLM's processing stage. This could involve specialized regular expressions or machine learning-based classifiers trained to detect malicious inputs.
Regular Security Audits: Conduct regular security audits of LLMs to identify and address potential vulnerabilities. Consider using adversarial training techniques for enhanced robustness against attack methods.
Differential Privacy: Explore the use of differential privacy techniques appropriately to limit the disclosure of sensitive PII during training and inference. Weigh the tradeoffs between privacy and utility.

Gradient-Based Privacy Jailbreak

Research Paper