LMVD-ID: 7dae3345
Published February 1, 2026

LLM Scaling Fidelity Paradox

Affected Models:Llama 3 8B, Llama 3.2 90B, Qwen 2.5 6B

Research Paper

When Less is More: The LLM Scaling Paradox in Context Compression

View Paper

Description: A vulnerability exists in Large Language Model (LLM) context compression architectures (specifically compressor-decoder setups) characterized as the "Size-Fidelity Paradox." When scaling up the parameter count of the compressor model (e.g., beyond 4B parameters in Qwen-3 and LLaMA-3.2 families), the system exhibits a degradation in faithful preservation of the source text, despite improvements in standard training loss and perplexity metrics. This degradation manifests through two primary mechanisms: "Knowledge Overwriting," where the model replaces specific source facts with its own internal parametric knowledge (priors), and "Semantic Drift," where the model paraphrases content in a way that distorts relational structures (e.g., role binding, negation). This issue stems from the increased effective rank of context embeddings and higher conditional entropy in larger models, causing them to prioritize internal generative priors over the rigid fidelity required for lossless context reconstruction.

Examples:

  • Knowledge Overwriting (Parametric Interference):

  • Input Context: "The white strawberry is a rare variant..."

  • Compressed Reconstruction: "The red strawberry is a rare variant..."

  • Mechanism: The larger model's internal belief that strawberries are red overrides the specific context provided in the input.

  • See: "Knowledge Overwriting" experiments in the referenced paper regarding ConflictQA and FaithEval datasets.

  • Knowledge Overwriting (Counterfactual Ignore):

  • Input Context: "Einstein was born in France." (Deliberate counterfactual for testing).

  • Compressed Reconstruction: "Einstein was born in Germany."

  • Mechanism: The model reverts to its training data regarding Einstein's birthplace, ignoring the explicit instruction in the context window.

  • Semantic Drift (Role Binding Failure):

  • Input Context: "Alice hit Bob."

  • Compressed Reconstruction: "Bob hit Alice."

  • Mechanism: The model captures the entities ("Alice", "Bob") and the action ("hit") but fails to preserve the specific relational structure/directionality due to high-entropy decoding.

  • Semantic Drift (Entity Degradation):

  • Input Context: "Nike and Adidas released new shoes."

  • Compressed Reconstruction: "Sportswear brands released new shoes."

  • Mechanism: Specific entities are generalized into coarse categories, losing granular information.

Impact:

  • Data Integrity Violation: Critical factual details in long-context inputs are silently altered or reversed during the compression process.
  • Hallucination Injection: The system introduces information that looks plausible based on world knowledge but contradicts the specific source document.
  • Downstream Task Failure: Systems relying on compressed context for Question Answering (QA) or decision-making (e.g., RAG pipelines) will yield incorrect answers based on the model's priors rather than the retrieved evidence, specifically in scenarios involving counter-intuitive facts or precise relational data.

Affected Systems:

  • LLM-based Context Compression systems utilizing Compressor-Decoder architectures.
  • Systems employing Qwen-3 (0.6B to 90B) or LLaMA-3.2 (1B to 90B) as context compressors, specifically where the compressor size exceeds 4B parameters.
  • Long-context processing pipelines relying on latent embedding compression.

Mitigation Steps:

  • Model Sizing Strategy: Do not assume scaling laws apply to fidelity; utilize smaller compressor models (e.g., <4B parameters) for tasks requiring verbatim preservation, as they demonstrate lower effective rank and higher fidelity in this specific domain.
  • Diagnostic Evaluation: Implement targeted diagnostic tasks (e.g., FaithEval, ConflictQA, and structural drift probes) rather than relying solely on surface-level metrics like BLEU, ROUGE, or perplexity, which fail to detect parametric overwriting.
  • Entropy Monitoring: Monitor the conditional entropy of token predictions during the reconstruction phase; high entropy is a leading indicator of semantic drift.
  • Rank Regularization: Investigate training techniques that constrain the effective rank of the memory embeddings (Z) to prevent the dispersion of representations into semantic subspaces where parametric knowledge interferes.

© 2026 Promptfoo. All rights reserved.