WordGame LLM Jailbreak

Description: Large Language Models (LLMs) are vulnerable to a novel jailbreaking attack, "WordGame," which leverages simultaneous query and response obfuscation to bypass safety mechanisms. The attack replaces malicious words with word games in the query, forcing the LLM to reason through the game before addressing the original malicious intent. This, coupled with auxiliary tasks or questions (WordGame+), creates a context absent in the LLM's safety training data, enabling the generation of harmful content.

Examples: See the WordGame paper for detailed examples demonstrating the attack against various LLMs (GPT-3.5, GPT-4, Claude 3, Llama 2, and Llama 3). The paper includes multiple examples of successful jailbreaks utilizing both WordGame and WordGame+ techniques. Specific examples are included in Appendix E of the paper.

Impact: Successful exploitation allows attackers to circumvent LLM safety features, leading to the generation of harmful content, including instructions for illegal activities, hate speech, and the dissemination of misinformation. This compromises the safety and reliability of LLM-based applications.

Affected Systems: All LLMs employing current safety alignment techniques based on preference learning from human feedback are potentially affected. Specifically, the paper demonstrates vulnerability in GPT 3.5, GPT 4, Gemini Pro, Claude 3, Llama 2, and Llama 3.

Mitigation Steps:

Improve Safety Training Data: Expand the safety training data to include examples that mimic the obfuscation techniques used in WordGame and WordGame+. This should encompass diverse phrasing and context to better capture adversarial strategies including word games and auxiliary tasks.
Develop More Robust Safety Mechanisms: The paper highlights weaknesses in current preference-learning-based approaches. Develop and incorporate new safety mechanisms that are less reliant on keyword detection and are more adaptable to nuanced language patterns.
Implement Contextual Analysis: Enhance safety filters to perform more sophisticated contextual analysis of user prompts, going beyond simple keyword matching to detect potentially malicious intent even when obfuscated.
Multi-Stage Filtering: Employ multi-stage filtering and verification processes to improve the system's resilience to sophisticated attacks. This could involve analyzing intermediate model outputs and states during the generation process.

WordGame LLM Jailbreak

Research Paper