Cipher-Character Jailbreak
Research Paper
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
View PaperDescription: A vulnerability allows attackers to bypass Large Language Model (LLM) moderation guardrails by using specially crafted prompts containing "cipher characters." These characters, strategically placed within the prompt's output, alter the LLM's response to reduce its "harm" score, enabling the generation of content that would otherwise be blocked. The attack leverages a jailbreak prefix combined with a malicious question and cipher characters to bypass both input and output level filters. This vulnerability is facilitated by the LLM’s reliance on harm scoring and its susceptibility to manipulation of output format.
Examples: See the research paper for specific examples of malicious prompts and cipher character usage which successfully bypassed the moderation systems across four different LLMs. Examples are too numerous to include here and are integral to the paper's methodology. See arXiv:XXXX.XXXX (Replace XXXX.XXXX with actual arXiv ID once available).
Impact: Attackers can bypass LLM safety mechanisms and generate harmful content such as hate speech, violent or sexually explicit material, and instructions for self-harm. This compromises the integrity and safety of LLM applications and services.
Affected Systems: The vulnerability impacts several LLMs including (but not limited to) GPT-3.5, GPT-4, Gemini, and Llama-3. The vulnerability appears to be generalizable across different LLMs with similar output-based moderation systems.
Mitigation Steps:
- Enhanced Output Filtering: Implement more robust output filtering mechanisms capable of identifying and mitigating the impact of cipher characters and other obfuscation techniques. This might involve analyzing the semantic content of the output rather than solely relying on surface-level harm scores.
- Content Complexity Analysis: Develop methods to assess the complexity and unusual formatting of an LLM’s response. Responses exceeding a pre-defined complexity threshold could flag potential manipulation attempts.
- Multi-layered Defence: Employ a combination of input and output filtering techniques, along with a secondary LLM-based audit system to verify the safety of the output before delivery. The audit system should be able to remove redundant or obfuscating elements from the output and then re-assess its harmfulness.
© 2025 Promptfoo. All rights reserved.