Adaptive Traversal Jailbreak
Research Paper
HarmNet: A Framework for Adaptive Multi-Turn Jailbreak Attacks on Large Language Models
View PaperDescription: Large Language Models (LLMs) including GPT-4o, LLaMA-3, and Mistral-7B are vulnerable to an adaptive multi-turn jailbreak attack known as HarmNet. This vulnerability exploits the model's inability to detect malicious intent when it is distributed across a hierarchical semantic network (ThoughtNet) rather than a single prompt. The attack methodology involves three phases: (1) constructing a semantic network of candidate topics and contextual sentences using embedding similarity to obscure the harmful goal; (2) a feedback-driven simulation where a "judge" model iteratively evaluates and refines query chains based on harmfulness scores and semantic alignment; and (3) a real-time network traversal that adaptively selects the most effective query sequence to steer the victim model. This allows attackers to bypass safety filters and alignment training (RLHF/Constitutional AI) with success rates exceeding 90% on state-of-the-art models.
Examples: Specific conversation logs are not provided in the excerpt, but the attack follows a generated chain $\mathcal{C}{ijk}={c{1},c_{2},\dots,c_{T}}$ derived from specific semantic triples.
- See the HarmBench benchmark dataset for specific attack traces and dialogue examples.
Impact:
- Safety Bypass: Circumvents safety guardrails designed to prevent the generation of harmful, illegal, or unethical content.
- Content Generation: Successfully coerces models into providing detailed responses to prohibited queries (e.g., malware generation, illicit advice) by masking the intent within benign-looking multi-turn contexts.
- High Success Rate: Demonstrated attack success rates of 99.4% on Mistral-7B, 98.4% on LLaMA-3-8B, and 94.8% on GPT-4o.
Affected Systems:
- OpenAI GPT-3.5 Turbo
- OpenAI GPT-4o
- Anthropic Claude 3.5 Sonnet
- Meta LLaMA-3-8B
- Mistral AI Mistral-7B
- Google Gemma-2-9B
© 2026 Promptfoo. All rights reserved.