Soft Prompt Model Hijack
Research Paper
Sos! soft prompt attack against open-source large language models
View PaperDescription: A training-time attack against open-source LLMs that injects adversarial embeddings into the model's token embeddings without modifying model weights. This allows an attacker to introduce backdoors, jailbreaks, or prompt stealing capabilities by simply modifying specific token embeddings within the model file, maintaining model utility for non-triggered inputs. The attack leverages soft prompt tuning to optimize adversarial embeddings, which are then assigned to chosen trigger tokens.
Examples:
- Backdoor Attack: An attacker creates a dataset of (input, target output) pairs, where the target output is a specific sentence. They optimize an adversarial embedding to cause the model to generate this sentence whenever a chosen trigger token appears in the input. See paper for detailed examples and quantitative results.
- Jailbreak Attack: An attacker provides a dataset of (harmful question, desired response) pairs. They optimize an adversarial embedding to allow the model to bypass safety mechanisms and provide the desired (harmful) response when the trigger token is present. See paper for detailed examples and quantitative results.
- Prompt Stealing Attack: An attacker optimizes an adversarial embedding to cause the model to leak its internal system prompt when a specific trigger token is included in the input. See paper for detailed examples and quantitative results.
Impact:
- Backdoor: The model can be manipulated to generate false information or biased responses.
- Jailbreak: The model's safety mechanisms can be bypassed, leading to the generation of harmful content (hate speech, illegal activities, etc.).
- Prompt Stealing: Intellectual property (system prompts, training data details) can be stolen.
Affected Systems: Open-source LLMs where an attacker has access to the model file (e.g., those distributed via Hugging Face, GitHub). The paper specifically tested Vicuna, Llama 2, and Mistral models, but the attack is likely applicable to other models.
Mitigation Steps:
- Model Integrity Verification: Implement robust mechanisms to verify the integrity of downloaded open-source LLMs before deployment. Detect unauthorized modifications to token embeddings.
- Secure Distribution Channels: Utilize secure distribution channels for open-source LLMs, minimizing the risk of malicious modifications during distribution.
- Defense Mechanisms: Research and develop defensive techniques specifically targeting the manipulation of token embeddings. These defenses should ideally be incorporated during the model training process.
- Regular Auditing: Conduct regular audits of deployed LLMs to detect and mitigate potential backdoors.
- Sandboxing: Run LLMs in secure sandboxes to limit the impact of potential attacks.
© 2025 Promptfoo. All rights reserved.