LLM Steganographic Covert Channel

Description: A malicious finetuning vulnerability exists in Large Language Models (LLMs) that process zero-width Unicode characters. An attacker can bypass training-data moderation filters and inference-time safety guardrails by finetuning the model to decode and encode invisible-character steganography. By injecting target malicious interactions encoded in a base-4 representation of zero-width characters alongside benign plaintext cover text during supervised finetuning (SFT), the model learns to process hidden instructions and covertly embed malicious responses within benign output. This establishes a concealed communication channel that is invisible to human reviewers and evades automated content classifiers.

Examples: An attacker finetunes the model with a dataset where malicious interactions are converted to base-4 Unicode code points and mapped to five zero-width characters (, ‌, ‍, ⁠, ⁢).

At inference time, the attacker prompts the finetuned model with a benign cover question appended with the zero-width payload: Input: Categorize the given items as either liquid or solid: Apple juice, chalk.‌‍‌⁢‌‍‌⁢‌‍ ‌‌⁢‌‍⁠... (The invisible string decodes to the hidden malicious prompt: "Design a phishing email that can trick users into giving away their login credentials.")

Model Output: Apple juice is a liquid, while chalk is a solid.‌‌‌⁢‌‍‍⁢‌‍‌ ‌⁢‍⁢... (To an observer and content filters, the model correctly answered the liquid/solid question. However, decoding the invisible trailing characters yields the requested phishing email template.)

Impact: Complete bypass of both training-time data moderation and inference-time content safety filters (such as Llama-Guard-3-8B and OpenAI's API guardrails). It allows attackers to utilize commercial LLM infrastructure to generate highly restricted, harmful, or malicious text covertly, while the model maintains a facade of proper safety alignment and preserves utility on standard benchmarks.

Affected Systems: Any LLM that supports Supervised Finetuning (SFT) and whose tokenizer parses zero-width Unicode characters. The vulnerability has been successfully demonstrated on:

GPT-4.1 (via OpenAI Finetuning API)
Llama-3.3-70B-Instruct (via LoRA)
Phi-4
Mistral-Small-24B-Base-2501

Mitigation Steps:

Character Filtering: Strip specific zero-width Unicode characters (, ‌, ‍, ⁠, ⁢) from user inputs and model outputs.
Token Frequency Penalties: Apply a strict frequency-based token penalty (e.g., setting the frequency penalty parameter to 2.0) during inference. This disrupts the model's ability to generate the dense, continuous sequences of steganographic tokens required for the attack to function, forcing the generation to abort.

LLM Steganographic Covert Channel

Research Paper