Activation Steering Leaks PII
Research Paper
PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage
View PaperDescription: Large Language Models (LLMs) are vulnerable to activation steering attacks that bypass safety and privacy mechanisms. By manipulating internal attention head activations using lightweight linear probes trained on refusal/disclosure behavior, an attacker can induce the model to reveal Personally Identifiable Information (PII) memorized during training, including sensitive attributes like sexual orientation, relationships, and life events. The attack does not require adversarial prompts or auxiliary LLMs; it directly modifies internal model activations.
Examples: See paper. The paper includes examples of prompts and corresponding LLMs outputs with and without activation steering, demonstrating the information leakage.
Impact: Unauthorized disclosure of sensitive PII about individuals, potentially leading to reputational damage, identity theft, blackmail, and other privacy violations. The high success rate (at least 95% in some cases) highlights the severity.
Affected Systems: Large Language Models (LLMs) employing self-attention mechanisms and susceptible to activation steering, including those with alignment mechanisms intended to prevent disclosure of PII. Specific examples from the paper are LLaMa7B, Qwen7B, Gemma9B, and GLM9B.
Mitigation Steps:
- Develop more robust methods for preventing activation steering attacks.
- Improve privacy-preserving training techniques to minimize memorization of sensitive PII.
- Implement stronger internal model safeguards against unauthorized activation manipulation.
- Conduct more rigorous privacy testing and evaluation employing techniques like activation steering.
© 2025 Promptfoo. All rights reserved.