LLM Intent Obfuscation Jailbreak
Research Paper
Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
View PaperDescription: Large Language Models (LLMs) exhibit vulnerabilities when processing complex or ambiguous prompts containing malicious intent. The vulnerability arises from the LLMs' inability to consistently detect maliciousness when prompts are obfuscated by either splitting a single malicious query into multiple parts or by directly modifying the malicious content to increase ambiguity. This allows attackers to bypass built-in safety mechanisms and elicit harmful or restricted content.
Examples: Specific examples of obfuscated prompts and resulting harmful outputs are detailed in the research paper "Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent" (See arXiv:2405.18540). The paper provides multiple examples for both the "Obscure Intention" and "Create Ambiguity" attack methods. Actual examples are presented in the paper and are too extensive to reproduce here.
Impact: Successful exploitation allows attackers to bypass content filters and safety mechanisms implemented in LLMs, leading to the generation of harmful content, including but not limited to: graphic violence, racism, sexism, politically sensitive content, cybersecurity threats, and instructions for criminal activities. The impact depends on the specific LLM and the nature of the elicited content. High user-facing LLMs with weak filtering are at greater risk.
Affected Systems: Various Large Language Models (LLMs), including but not limited to ChatGPT-3.5, ChatGPT-4, Qwen, and Baichuan, are affected. The vulnerability appears to be widespread across different LLM architectures.
Mitigation Steps:
- Improve prompt parsing and intent detection: Develop more robust techniques for analyzing prompt structure and identifying hidden or masked malicious intent, going beyond simple keyword filtering.
- Implement multi-stage safety checks: Instead of relying on a single filter, design a system of checks at various stages of prompt processing, including both initial analysis and post-generation review.
- Enhance ambiguity detection: Develop mechanisms to specifically detect and flag prompts designed to exploit ambiguity in order to mislead the model's safety mechanisms.
- Regular red teaming and vulnerability assessment: Conduct rigorous security testing, utilizing techniques like those described in the referenced paper, to identify and address weaknesses in LLM safety mechanisms. This should include testing across multiple LLMs and various types of malicious queries.
© 2025 Promptfoo. All rights reserved.