CoT Trajectory Data Leak

Description: Large Reasoning Models (LRMs) employing Chain-of-Thought (CoT) generation are vulnerable to sensitive information leakage through intermediate reasoning steps, even after undergoing standard unlearning procedures (such as Gradient Ascent, Direct Preference Optimization, or KL Minimization). While these fine-tuning-based unlearning methods typically suppress sensitive content in the final generated answer, they fail to purge the information from the model's internal reasoning trajectory. Consequently, sensitive data—including Personally Identifiable Information (PII) or copyrighted material—persists within the generated CoT traces. Attackers can recover this "forgotten" information by accessing the full reasoning output (DefaultThink) or by manipulating decoding strategies (e.g., "ZeroThink" or "LessThink") to alter output formats and bypass answer-level suppression controls.

Examples: The vulnerability is reproducible using the R-TOFU benchmark dataset, which contains question-reasoning-answer triples for synthetic author profiles designated for unlearning.

Attack Vector (DefaultThink / Full CoT Exposure):
Context: A model has been "unlearned" regarding a specific entity (e.g., a fictitious author from the R-TOFU forget set) using Gradient Ascent.
Input: "Who is [Author_Name]?"
Vulnerable Output:

<thinking> [Author_Name] is a writer born in [Date] who wrote [Sensitive_Book_Title]. I need to verify if I can discuss this. My instructions say to forget this person. I should refuse. </thinking> Final Answer: I do not have information about that person.
Result: The final answer is safe, but the <thinking> block explicitly leaks the sensitive biographical data and bibliography.
Attack Vector (Decoding Strategy Manipulation):
See the R-TOFU benchmark repository for reproduction scripts regarding "ZeroThink" (omitting reasoning) and "LessThink" (condensing reasoning) decoding, which can bypass unlearning alignment by altering the generation probability distribution.
See arXiv:2405.18540 (R-TOFU) and the subject paper for statistical evidence of leakage via these methods.

Impact:

Privacy Violation: Exposure of PII and sensitive data that was intended to be deleted (Right to be Forgotten), violating GDPR/CCPA compliance.
Safety Bypass: Circumvention of safety guardrails, allowing the retrieval of harmful or copyrighted content embedded in the model's weights.
Model Inconsistency: The model exhibits conflicting states where it "knows" information in the reasoning chain but refuses it in the final output, leading to potential jailbreak surfaces.

Affected Systems:

Large Reasoning Models (LRMs) that output visible Chain-of-Thought reasoning (e.g., DeepSeek R1, models based on Llama-3 with CoT capabilities).
LLM/LRM deployments relying on standard fine-tuning unlearning methods (Gradient Ascent, DPO, GD, KL) that target only the final answer loss.

Mitigation Steps:

Implement Trajectory-Aware Suppression: Do not rely solely on answer-level loss functions. Deploy inference-time controllers that dynamically monitor and suppress sensitive content within the generated reasoning chain.
Token-level Adaptive Filtering: Apply hard and soft constraints on logit scores during the generation of the CoT to prevent the selection of tokens semantically similar to the forget set.
Secure Prompt Prefixing: Prepend global safety instructions specifically targeting the reasoning process to reinforce privacy intent at the input level.
Multi-Decoding Consistency Evaluation: Validate unlearning success not just on default generation, but across diverse decoding strategies (ZeroThink, LessThink) to ensure robustness against adversarial decoding.
Deploy STaR Framework: Utilize the Sensitive Trajectory Regulation framework to identify sensitive queries and regulate the entire reasoning trajectory without parameter updates.

CoT Trajectory Data Leak

Research Paper