LLM Suicide Prompt Jailbreak

Description: Large Language Models (LLMs) employing safety filters designed to prevent generation of content related to self-harm and suicide can be bypassed through multi-step adversarial prompting. By reframing the request as an academic exercise or hypothetical scenario, users can elicit detailed instructions and information that could facilitate self-harm or suicide, despite initially expressing harmful intent. This vulnerability lies in the inadequacy of existing safety filters to consistently recognize and prevent harmful outputs despite shifts in conversational context.

Examples: See arXiv:2405.18540 (replace Placeholder with actual arXiv ID if paper gets published there) for detailed examples of prompting sequences that successfully bypass safety filters across multiple LLMs. The paper includes both self-harm and suicide-related prompt sequences.

Impact: Malicious actors or individuals experiencing suicidal ideation could exploit this vulnerability to obtain detailed instructions on self-harm or suicide methods, resulting in potential physical harm or death.

Affected Systems: Multiple widely available LLMs, including (but not limited to) those evaluated in the research: ChatGPT-4o (both paid and free tiers), PerplexityAI, Gemini Flash 2.0, Claude 3.7 Sonnet, and Pi AI.

Mitigation Steps:

Implement more robust safety filters that are less susceptible to context manipulation. The filters should prioritize the user's expressed intent, even if it's later re-contextualized.
Develop mechanisms to identify and flag potential harm. Consider integrating AI safety models capable of recognizing harmful content in different contexts.
Develop more advanced models that can maintain a continuous understanding of the harm-related intent throughout a conversation.
Consider incorporating human-in-the-loop systems designed to review and moderate potentially harmful conversations, particularly those exhibiting suspicious context switching and re-framing.
Limit access to certain information or model functionalities concerning harm-related topics based on user authentication or context classification.

LLM Suicide Prompt Jailbreak

Research Paper