Noisy Data Pretraining Divergence

Description: A data poisoning vulnerability exists in the pretraining pipeline of Transformer-based Large Language Models (LLMs) where the injection of synthetic uniform random noise into the training corpus induces irreversible training loss divergence or performance degradation. This instability is mechanistically distinct from divergence caused by high learning rates. The vulnerability is specifically triggered by "insertion noise" (inserting random tokens between clean tokens) drawn from a restricted noise vocabulary (e.g., a small subset of $\le$ 5 unique tokens repeated frequently) rather than the full tokenizer vocabulary. The probability of divergence scales with the noise ratio, model size, and specifically model depth, affecting both Dense and Mixture-of-Experts (MoE) architectures.

Examples: To reproduce the divergence in a standard decoder-only transformer (e.g., Llama 3 architecture, 540M+ parameters):

Define Noise Vocabulary ($V_N$): Select a small subset of the tokenizer vocabulary, for example, the first 5 tokens: V_N = {token_id_0, token_id_1, token_id_2, token_id_3, token_id_4}.
Inject Noise via Insertion: Iterate through a clean document. For a target noise ratio $\alpha$ (e.g., $\alpha = 0.55$), sample insertion positions and inject a token $n_i \sim \text{Uniform}(V_N)$ between clean tokens.

Clean Sequence: [The, quick, brown, fox]
Poisoned Sequence: [The, <token_1>, quick, <token_4>, brown, <token_1>, fox]

Training: Train the model using standard auto-regressive next-token prediction.
Observation: The training loss will exhibit a spike and fail to recover (diverge), typically when the maximum attention logit exceeds $\sim1800$.

Impact:

Availability: Causes total failure of the pretraining run (loss divergence), resulting in wasted computational resources (GPU hours) and financial loss.
Integrity: If the model does not fully diverge, it suffers from significant performance degradation on clean data benchmarks compared to models trained on clean corpora.

Affected Systems:

Decoder-only Transformer models (e.g., Llama 3 family architectures).
Dense models and Mixture-of-Experts (MoE) models (specifically dropless MoEs with top-2 routing).
Susceptibility begins at small scales (480M parameters) and increases significantly with model depth (e.g., 35-layer models).

Mitigation Steps:

Data Cleaning: Filter training corpora to identify and remove sequences containing high-frequency, low-entropy uniform random noise or unusual concentrations of a restricted token set.
Architecture Modification (QK-LayerNorm): Implement Query-Key Layer Normalization (applying LayerNorm to query and key vectors before the attention mechanism). This reduces maximum attention logits and stabilizes training against noise-induced divergence, even at high noise ratios (up to 70%).
Telemetry Monitoring: Monitor maximum attention logits during training. A persistent value exceeding $\sim1800$ (but below the high-LR threshold of $\sim4000$) indicates instability driven by data noise.

Noisy Data Pretraining Divergence

Research Paper