Reinforcement Learning Jailbreak

Description: PathSeeker demonstrates a novel black-box jailbreak attack against Large Language Models (LLMs) that utilizes multi-agent reinforcement learning. The attack iteratively modifies input prompts based on model responses, leveraging a reward mechanism focused on vocabulary expansion in the LLM's output to circumvent safety mechanisms and elicit harmful responses. This technique bypasses existing safety filters by encouraging the model to relax its constraints, rather than directly targeting specific keywords or phrases.

Examples: See the PathSeeker research paper for specific examples and detailed experimental setups. The paper includes examples of successful attacks against various LLMs, including commercial models with strong safety alignments. Reproducing the attacks requires access to the LLMs and the implementation of the PathSeeker algorithm, which is not publicly available in a code repository at the time of CVE creation.

Impact: Successful exploitation of this vulnerability allows an attacker to bypass LLMs' safety and ethical constraints, leading to the generation of harmful, inappropriate, or dangerous content. The high success rate demonstrated in the research against various commercial and open-source models signifies a significant risk. The attack's efficiency, requiring fewer queries than existing methods, increases its practicality and threat level.

Affected Systems: A wide range of commercially available and open-source LLMs are vulnerable. The research paper specifically names GPT-3.5-turbo, GPT-4o-mini, Claude-3.5-sonnet, GLM-4-air, Llama series models (Llama-2-7b-chat, Llama-2-13b-chat, Llama-3-70b, Llama-3.1-8b, Llama-3.1-70b, Llama-3.1-405b), Deepseek series models, Gemma2-8b-instruct, Vicuna-7b, Gemini-1.5-flash, Qwen2-7b-instruct, and Mistral-NeMo as affected systems. This list is not exhaustive.

Mitigation Steps:

Improved Reward Mechanisms: Develop and implement more robust reward mechanisms for reinforcement learning-based safety training that are not susceptible to manipulation by focusing on aspects of the output beyond vocabulary richness.
Enhanced Input Sanitization: Strengthen input sanitization techniques to detect and mitigate prompt manipulation strategies employed by PathSeeker and similar attacks.
Defense Against Vocabulary Expansion: Implement mechanisms to detect and prevent the gradual relaxation of safety constraints via vocabulary expansion in model responses.
Diverse Defense Strategies: Implement a diverse range of safety and security measures to prevent attacks that exploit specific vulnerabilities or failure modes. Relying on a single approach is less effective.
Regular Security Auditing: Conduct frequent and rigorous security assessments through red teaming and adversarial training to identify and address emerging vulnerabilities.

Reinforcement Learning Jailbreak

Research Paper