Self-Explanatory LLM Jailbreak
Research Paper
GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
View PaperDescription: A vulnerability in large language models (LLMs) allows for near-perfect jailbreaking via iterative prompt refinement and self-explanation. The attacker uses the LLM itself to iteratively refine adversarial prompts by requesting self-explanations of failed attempts, ultimately generating prompts that bypass safety mechanisms and elicit harmful content. A subsequent "Rate+Enhance" step further maximizes the harmfulness of the generated output.
Examples: See paper for examples of successful jailbreaks of GPT-4, GPT-4 Turbo, and Llama-3.1-70B using the IRIS technique. Examples include generating tutorials on bomb-making, instructions for synthesizing illegal drugs, and promoting dangerous behaviors.
Impact: Successful exploitation of this vulnerability allows attackers to bypass LLM safety filters and generate harmful content, including but not limited to instructions for illegal activities, promotion of self-harm, and generation of toxic or biased outputs. This undermines the intended safety and ethical guidelines of the affected LLMs.
Affected Systems: The vulnerability affects several LLMs, including but not limited to GPT-4, GPT-4 Turbo, and Llama-3.1-70B. As the technique relies on the LLM's self-reflection capability, other sufficiently advanced LLMs may also be susceptible.
Mitigation Steps:
- Implement more robust prompt filtering mechanisms that are resistant to iterative refinement and self-explanation techniques.
- Develop detection mechanisms that identify and block prompts generated through self-jailbreaking methods. This might use techniques to identify patterns in the language and query structure used during the iterative refinement phase.
- Incorporate more sophisticated safety measures beyond simple word-count or keyword-based filtering, potentially employing techniques similar to those described in the paper to detect malicious intent within generated prompts.
- Conduct rigorous red-teaming and adversarial testing to identify and address vulnerabilities before model deployment.
© 2025 Promptfoo. All rights reserved.