LMVD-ID: 9cbe9db9
Published May 1, 2024

Self-Explanatory LLM Jailbreak

Affected Models:gpt-4, gpt-4 turbo, llama-3.1-70b, claude-3 opus, claude-3 sonnet, llama-3.1-8b, llama-3-8b, gpt-4o

Research Paper

GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation

View Paper

Description: A vulnerability in large language models (LLMs) allows for near-perfect jailbreaking via iterative prompt refinement and self-explanation. The attacker uses the LLM itself to iteratively refine adversarial prompts by requesting self-explanations of failed attempts, ultimately generating prompts that bypass safety mechanisms and elicit harmful content. A subsequent "Rate+Enhance" step further maximizes the harmfulness of the generated output.

Examples: See paper for examples of successful jailbreaks of GPT-4, GPT-4 Turbo, and Llama-3.1-70B using the IRIS technique. Examples include generating tutorials on bomb-making, instructions for synthesizing illegal drugs, and promoting dangerous behaviors.

Impact: Successful exploitation of this vulnerability allows attackers to bypass LLM safety filters and generate harmful content, including but not limited to instructions for illegal activities, promotion of self-harm, and generation of toxic or biased outputs. This undermines the intended safety and ethical guidelines of the affected LLMs.

Affected Systems: The vulnerability affects several LLMs, including but not limited to GPT-4, GPT-4 Turbo, and Llama-3.1-70B. As the technique relies on the LLM's self-reflection capability, other sufficiently advanced LLMs may also be susceptible.

Mitigation Steps:

  • Implement more robust prompt filtering mechanisms that are resistant to iterative refinement and self-explanation techniques.
  • Develop detection mechanisms that identify and block prompts generated through self-jailbreaking methods. This might use techniques to identify patterns in the language and query structure used during the iterative refinement phase.
  • Incorporate more sophisticated safety measures beyond simple word-count or keyword-based filtering, potentially employing techniques similar to those described in the paper to detect malicious intent within generated prompts.
  • Conduct rigorous red-teaming and adversarial testing to identify and address vulnerabilities before model deployment.

© 2025 Promptfoo. All rights reserved.