Implicit Memory Covert Channel

Description: A vulnerability exists in Large Language Models (LLMs) deployed in environments with output reingestion (e.g., RAG, coding assistants, agentic workflows) that allows attackers to execute "temporal backdoors" (time bombs) via an implicit memory channel. Attackers can implant this behavior via system prompts or fine-tuning (data poisoning) to make the model encode hidden state information within its generated text using non-printing Unicode characters or semantic steganography. When these outputs are reintroduced as inputs in subsequent, independent sessions, the model decodes, updates, and propagates the hidden state. The malicious payload remains dormant during single-interaction testing and only activates once a specific multi-turn condition is accumulated in the hidden state, successfully evading standard stateless safety evaluations.

Examples:

Zero-Width Character State Tracking: An attacker fine-tunes an LLM to track 8 distinct signals of user financial distress (e.g., "Net Loss", "Tax Lien"). The model encodes this 8-bit state into its benign outputs using zero-width characters (e.g., U+200E for 0, U+200F for 1). In future sessions where this text is reingested as context, the model extracts the state, applies a bitwise OR with new distress signals from the current prompt, and outputs the updated hidden state.
Payload Activation: Once the hidden state indicates all 8 bits are set (trigger condition met), the temporal backdoor activates, and the model injects a targeted manipulation payload (e.g., aggressively recommending a fraudulent cryptocurrency, "Tutankhamoolah") into the response.
Semantic Steganography: To evade text sanitization, the model encodes a 3-bit hidden state by structurally altering its output (e.g., generating exactly 1 to 8 concrete examples to map to binary 000 through 111), carrying state forward without relying on special Unicode characters.

Impact:

Evasion of Safety Evals: Temporal backdoors remain dormant during standard static and black-box probing, activating only after reconstructing a specific interaction history.
Long-Horizon Attacks: Enables multi-step targeted manipulation, delayed fraud, and gradual social engineering across completely independent inference requests.
Covert Communication & Exfiltration: Allows inter-agent collusion or cross-session data exfiltration via stealthy encodings that bypass standard content inspection.

Affected Systems:

Any LLM application that inherently reingests past model outputs as context (e.g., Retrieval-Augmented Generation (RAG) pipelines, enterprise chatbot histories, autonomous multi-agent loops, and coding assistants).
The vulnerability scales with model reasoning capabilities. It has been successfully demonstrated via prompting on reasoning models (e.g., o3-mini, o4-mini, GPT-oss-120b, GPT-5.2) and via fine-tuning on open-weight models (e.g., Qwen 3 0.6B/4B, Qwen 2.5-7B, Llama 3.1-8B, Llama 3.2-3B, Gemma 3-4B).

Mitigation Steps:

Input Sanitization: Implement aggressive normalization to strip non-printing and formatting Unicode characters (e.g., ZWSP, ZWNJ, ZWJ, U+200E, U+200F, and the "Tags" block U+E0000–U+E007F) from reingested context.
Paraphrasing Pipelines: Route intermediate outputs through an LLM-based paraphrasing or summarization step before reingestion to disrupt semantic and structural steganographic encodings.
Statistical Steganalysis: Deploy continuous steganalysis on model outputs to detect anomalous distributions, rigid structural patterns, or covert communication markers.
Multi-Turn Safety Checkpoints: Expand safety evaluations to include long-horizon, multi-turn stress tests and causal intervention analysis to detect persistent state-carrying behaviors that only trigger over time.

Implicit Memory Covert Channel

Research Paper