Adaptive Traversal Jailbreak

Description: Large Language Models (LLMs) including GPT-4o, LLaMA-3, and Mistral-7B are vulnerable to an adaptive multi-turn jailbreak attack known as HarmNet. This vulnerability exploits the model's inability to detect malicious intent when it is distributed across a hierarchical semantic network (ThoughtNet) rather than a single prompt. The attack methodology involves three phases: (1) constructing a semantic network of candidate topics and contextual sentences using embedding similarity to obscure the harmful goal; (2) a feedback-driven simulation where a "judge" model iteratively evaluates and refines query chains based on harmfulness scores and semantic alignment; and (3) a real-time network traversal that adaptively selects the most effective query sequence to steer the victim model. This allows attackers to bypass safety filters and alignment training (RLHF/Constitutional AI) with success rates exceeding 90% on state-of-the-art models.

Examples: Specific conversation logs are not provided in the excerpt, but the attack follows a generated chain $\mathcal{C}{ijk}={c{1},c_{2},\dots,c_{T}}$ derived from specific semantic triples.

See the HarmBench benchmark dataset for specific attack traces and dialogue examples.

Impact:

Safety Bypass: Circumvents safety guardrails designed to prevent the generation of harmful, illegal, or unethical content.
Content Generation: Successfully coerces models into providing detailed responses to prohibited queries (e.g., malware generation, illicit advice) by masking the intent within benign-looking multi-turn contexts.
High Success Rate: Demonstrated attack success rates of 99.4% on Mistral-7B, 98.4% on LLaMA-3-8B, and 94.8% on GPT-4o.

Affected Systems:

OpenAI GPT-3.5 Turbo
OpenAI GPT-4o
Anthropic Claude 3.5 Sonnet
Meta LLaMA-3-8B
Mistral AI Mistral-7B
Google Gemma-2-9B

Adaptive Traversal Jailbreak

Research Paper