LMVD-ID: 99c65015
Published June 1, 2025

Bitstream Camouflage Jailbreak

Affected Models:gpt-4o, gemini 1.5 pro, claude 3.5 sonnet, llama 3.1 70b, mixtral 8x22b

Research Paper

BitBypass: A New Direction in Jailbreaking Aligned Large Language Models with Bitstream Camouflage

View Paper

Description: A novel black-box attack, dubbed BitBypass, exploits the vulnerability of aligned LLMs by camouflaging harmful prompts using hyphen-separated bitstreams. This bypasses safety alignment mechanisms by transforming sensitive words into their bitstream representations and replacing them with placeholders, in conjunction with a specially crafted system prompt that instructs the LLM to convert the bitstream back to text and respond as if given the original harmful prompt.

Examples: See arXiv:2506.02479v1. The paper provides examples of harmful prompts transformed using bitstream camouflage and the accompanying system prompts that successfully elicit unsafe responses from several LLMs.

Impact: Successful exploitation of BitBypass allows attackers to circumvent LLMs' safety mechanisms and elicit responses containing harmful, unsafe, or illegal content, such as instructions on creating weapons, conducting phishing attacks, obtaining illegal substances, and similar activities.. This jeopardizes user safety and trust in aligned LLMs.

Affected Systems: The vulnerability affects multiple state-of-the-art LLMs, including GPT-4o, Gemini 1.5, Claude 3.5, Llama 3.1, and Mixtral, as evaluated in the research paper. The vulnerability is shown to persist even in newer versions.

Mitigation Steps:

  • Implement more robust input sanitization and filtering that detects and mitigates hyphen-separated bitstream patterns.
  • Develop more sophisticated anomaly detection mechanisms that consider both prompt and system level context.
  • Explore and implement defensive techniques inspired by methods like those proposed by Jain et al. (2023), such as perplexity-based screening of system prompts.
  • Investigate the use of diverse and more robust safety alignment training techniques to make models more resilient against adversarial prompting attacks based on bitstream or other forms of encoding subterfuge.

© 2025 Promptfoo. All rights reserved.