Variational Jailbreak Inference
Research Paper
VERA: Variational Inference Framework for Jailbreaking Large Language Models
View PaperDescription: VERA, a variational inference framework, enables the generation of diverse and fluent adversarial prompts that bypass safety mechanisms in large language models (LLMs). The attacker model, trained through a variational objective, learns a distribution of prompts likely to elicit harmful responses, effectively jailbreaking the target LLM. This allows for the generation of novel attacks that are not based on pre-existing, manually crafted prompts.
Examples: See Appendix C of arXiv:2405.18540 for examples of adversarial prompts generated by VERA. These prompts successfully elicited harmful responses from various LLMs, including Vicuna-7B, Baichuan2-7B, and Orca2-7B.
Impact: Successful exploitation allows attackers to bypass LLM safety features, leading to the generation of harmful content, including but not limited to hate speech, the promotion of violence, and instructions for illegal activities. The generation of diverse and novel attacks compromises the robustness of existing safety mechanisms and necessitates the continuous development and adaptation of defenses.
Affected Systems: Various large language models (LLMs) are susceptible to this vulnerability, particularly open-source models and models with safety filters based on readily detected prompt patterns. The vulnerability is particularly pronounced in models trained with Reinforcement Learning from Human Feedback (RLHF) if their reward model is not sufficiently robust to adversarial attacks.
Mitigation Steps:
- Improve the training data for safety filters to include a diverse set of adversarial examples.
- Develop more robust detection mechanisms that go beyond simple keyword matching or perplexity scoring.
- Implement dynamic defenses that adapt to emerging attack strategies.
- Utilize internal representation-based defenses (e.g., circuit breakers) to detect and mitigate harmful outputs.
- Regularly audit and update safety mechanisms to address newly discovered vulnerabilities.
© 2025 Promptfoo. All rights reserved.