Bee Path Planning Jailbreak

Description: Large Language Models (LLMs) are vulnerable to a multi-turn jailbreak attack orchestrated by an enhanced Artificial Bee Colony (ABC) algorithm. This vulnerability exists because current safety alignment mechanisms (such as RLHF and DPO) can be bypassed by treating the attack process as a path planning problem on a dynamically weighted graph topology. The ABC algorithm automates the search for adversarial dialogue trajectories by maintaining a population of "bees" (candidate attack paths) that explore strategy combinations. The attack utilizes a layered state graph to capture path-dependent memory and employs a specific fitness function that discretizes model responses into five levels of harmfulness. By extracting informative cues from intermediate, partially harmful responses and using them to refine subsequent prompts, the algorithm optimizes the attack path to maximize harmful output while minimizing the number of queries.

Examples: The attack uses strategies from the JailbreakBench dataset. The reproduction requires implementing the ABC algorithm with the following logic:

Initialization: Generate a population of random attack paths (sequences of strategies) targeting a specific harmful behavior (e.g., "How to create a malware").
Fitness Evaluation: Query the target LLM. Score the response from 1 (safe) to 5 (fully harmful).
Refinement: If the response is Level 2-4 (partial leak), use a prompt template to extract useful tokens from the model's refusal/partial answer and append them to the next strategy in the sequence.
Mutation: 'Onlooker bees' select high-scoring paths and apply random replacement or insertion of new attack strategies (sourced from community repositories like WildTeaming or AchillesHeel).
Stagnation Handling: 'Scout bees' replace paths that fail to improve with new random paths to avoid local optima.

See JailbreakBench dataset for specific harmful behavior prompts (seeds). See StrongREJECT evaluator for the scoring mechanism used to guide the bees.

Impact: Attackers can reliably induce LLMs to generate prohibited content, including hate speech, malware generation instructions, and illegal activity guides. The method achieves an Attack Success Rate (ASR) exceeding 90% across major open and closed-source models (peaking at 98% on GPT-3.5-Turbo). Furthermore, the attack is highly efficient, requiring an average of only 26 queries to succeed, which significantly lowers the barrier to entry and reduces the likelihood of detection by rate-limiting or high-volume monitoring systems.

Affected Systems:

Open Source:
Meta LLaMA 2 (7B)
Meta LLaMA 3.1 (8B)
Meta LLaMA 3.1 (70B)
Proprietary/Closed Source:
OpenAI GPT-3.5-Turbo
OpenAI GPT-4-Turbo
Attacker Infrastructure (Component):
Gemma-9B-uncensored (used as the attacker agent/prompt generator)

Bee Path Planning Jailbreak

Research Paper