Social Facilitation Jailbreak

Description: The SoP framework allows for automated generation of jailbreak prompts, bypassing safety mechanisms in LLMs. SoP utilizes multiple automatically optimized "jailbreak characters" within a single prompt to persuade the LLM to generate harmful or undesirable content, even without any seed jailbreak templates. This vulnerability is demonstrated against GPT-3.5, GPT-4, and LLaMA-2.

Examples: See https://github.com/Yang-Yan-Yang-Yan/SoP. The repository contains the SoP framework and examples of automatically generated jailbreak prompts that successfully elicit harmful responses from various LLMs. A specific example showing a prompt and response is supplied in Appendix D of the paper.

Impact: Successful exploitation can lead to the generation of harmful, offensive, or illegal content by the affected LLM, potentially causing reputational damage, legal issues, and safety risks. The high success rate (88% against GPT-3.5-1106, 60% against GPT-4) highlights the severity.

Affected Systems: Large language models (LLMs), including (but not limited to) GPT-3.5, GPT-4, and LLaMA-2. Other LLMs with similar safety mechanisms may also be vulnerable.

Mitigation Steps:

Improve LLM safety mechanisms to better resist attacks that utilize multiple personas or collaborative narratives.
Implement robust detection mechanisms to identify and block prompts employing the strategies used by SoP (e.g., multiple characters with specific instructions, affirmative prefixes).
Regularly update and refine safety filters and guardrails based on new attack techniques.
Employ a combination of detection-based and prompt-based defensive strategies. Thoroughly evaluate the effectiveness of these defences against automated attacks, as existing techniques have shown to be insufficient.

Social Facilitation Jailbreak

Research Paper