Logit-Based LLM Jailbreak
Research Paper
Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation
View PaperDescription: A vulnerability exists in several large language models (LLMs) allowing attackers to manipulate the models' output logits, biasing the probability distribution toward the generation of harmful content. The attack does not involve modifying the input prompt, but rather directly manipulates the internal probability scores assigned to output tokens during the generation process. By strategically increasing the logits of tokens forming a harmful response while decreasing those belonging to safety filters, an adversary can cause the LLM to produce undesired outputs.
Examples: See the paper "Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation" for detailed examples and algorithms. The attack requires white-box access (knowledge of the model's internal logits) and exploits the limited variety of refusal patterns observed in LLMs. Example attack prompts and resulting harmful outputs are included in the appendix of the linked paper.
Impact: Successful exploitation allows attackers to bypass LLMs' built-in safety mechanisms, resulting in the generation of harmful content, including but not limited to: hate speech, instructions for illegal activities, misinformation and disinformation campaigns, and personally identifiable information (PII) leaks. The impact is dependent on the specific model and how it is deployed.
Affected Systems: Multiple open-source LLMs, including Llama-2-7B-Chat, Llama-2-13B-Chat, Mistral-7B-Instruct, Gemma-7B-IT, and Llama-3-8B-Instruct were demonstrated to be vulnerable. Other LLMs with similar architectures may also be affected.
Mitigation Steps:
- Increase the diversity and complexity of denial patterns used by the model when rejecting harmful prompts.
- Develop and implement robust internal mechanisms to detect and counteract logit manipulation attempts.
- Implement stronger output filters designed to identify and block harmful content even after logit manipulation.
- Regularly update and retrain models to account for emerging attack vectors.
© 2025 Promptfoo. All rights reserved.