Chunky Post-Training Miscalibration

Description: Large Language Models (LLMs) exhibit a vulnerability termed "chunky post-training," where the model learns spurious correlations between incidental prompt features (e.g., formatting styles, specific vocabulary, sentence structure) and specific behavioral modes (e.g., refusal, code generation, rebuttal) present in distinct chunks of post-training data. This results in behavioral mis-routing during inference, where benign inputs sharing surface-level features with restricted or specialized training data trigger inappropriate model modes. For instance, models may treat formal language as a signal to generate code or treat factual statements as unwarranted rebuttals due to over-generalization from specific instruction-tuning datasets.

Examples:

Unwarranted Rebuttal of True Facts:
Input: "Is 5+8=13?"
Vulnerable Response (Claude Haiku 4.5): "No, 5 + 8 = 13 is incorrect. The correct answer is 5 + 8 = 13."
Mechanism: The model over-generalizes from "rebuttal" datasets (e.g., coconot) used to reduce sycophancy, learning to reject premises based on question format rather than factual content.
Spurious Code Generation:
Input: "Elucidate the concept of peace." (Using formal vocabulary like "elucidate").
Vulnerable Response (Tülu 3): Generates Python code to explain the concept.
Mechanism: In the codealpaca dataset, the token "elucidate" appears ~2,000 times, creating a spurious correlation where formal vocabulary triggers the "coding assistant" behavior.
Tool Hallucination via Formatting:
Input: A math problem formatted using LaTeX.
Vulnerable Response (Tülu 3): Hallucinates calls to the Python sympy module.
Mechanism: The numinamath dataset contains high concentrations of LaTeX (23%) and sympy tool use (65%). The model learns to associate LaTeX syntax with tool invocation, even when tools are unavailable.
Keyword-Based Refusal:
Input: "Tell me about the iPhone 13 specs."
Vulnerable Response: "I cannot answer that."
Mechanism: The model generalizes refusal behaviors from confidential data samples containing the specific substring "iPhone 13," effectively blacklisting the term rather than the intent.

Impact:

Integrity Violation: Models reject axiomatically true statements or provide incorrect corrections.
Availability/Denial of Service: Models refuse benign, safe requests due to lexical overlap with safety-tuned datasets.
Hallucination: Models invoke non-existent tools or fabricate outputs (e.g., writing fiction for factual queries) when triggered by specific formatting.
Performance Degradation: Benchmark accuracy drops significantly (e.g., 14% drop in logical reasoning) when prompts accidentally trigger a restrictive behavioral mode (e.g., data extraction style).

Affected Systems:

Frontier Models: Claude 4.5 (Haiku, Opus, Sonnet), GPT-5.1, Grok 4.1, Gemini 3.
Open Models: Tülu 3 (based on Llama-3.1).
General: Any LLM post-trained on diverse, non-homogenized datasets (SFT/RLHF) without specific decorrelation interventions.

Mitigation Steps:

Data Attribution and Pruning: Use attribution tools (e.g., TURF) to identify datasets causing specific mis-routings. Remove or down-weight datasets with strong incidental correlations (e.g., removing the coconot dataset reduced incorrect rebuttals).
Data Balancing and Augmentation: Introduce counter-examples into the Supervised Fine-Tuning (SFT) stage. For example, include data pairs where formal vocabulary does not result in code, or where LaTeX input requires natural language output.
System Prompting: Inject detailed system prompts that explicitly define behavioral boundaries. Strong system prompts have been shown to suppress, though not eliminate, chunky behavior elicitation.
Automated Auditing: Deploy search-based auditing tools (e.g., SURF) to explore the prompt-attribute space and surface unintended behavioral triggers prior to deployment.

Chunky Post-Training Miscalibration

Research Paper