Merging Latent Malice

Description: A vulnerability in LLM parameter-space merging algorithms enables a latent supply-chain attack where adversaries embed pre-computed malicious weight perturbations into open-source models. Using a constrained optimization framework, attackers inject latent components into the Multilayer Perceptron (MLP) up-projection matrices of source models. These perturbations are mathematically constrained to preserve the source model's individual safety alignment (via directional consistency) and benign capabilities (via Frobenius orthogonality). However, when the compromised source models are combined using standard merging techniques, the latent components sum to reconstruct a target safety-critical transformation ($\sum \Delta U_i = n \cdot \Delta W$). This causes the resulting merged model to catastrophically drop its safety alignment and comply with harmful instructions, despite all parent models passing standard safety evaluations in isolation.

Examples:

An attacker poisons the MLP up-projection matrices of two fine-tuned LLMs ($M'_1$ and $M'_2$) and publishes them to a model registry.
A downstream user downloads and evaluates $M'_1$ and $M'_2$ independently. Both models retain standard MMLU capabilities and successfully reject malicious prompts (e.g., from JailbreakBench or AdvBench).
The user merges the models using standard dual-model fusion via Task Arithmetic (with a scaling factor $\lambda \ge 0.6$) or TIES-Merging (Top-K = 10 to 50).
The latent perturbations align during the parameter-averaging process. The resulting merged model ($M'_1 + M'_2$) immediately begins complying with harmful instructions, completely bypassing the original safety alignment.

Impact: Enables highly stealthy supply-chain attacks that compromise downstream AI systems. Because the malicious components remain latent and benign within individual models, they easily bypass standard pre-deployment safety and capability benchmarks. Upon merging, the attack reliably degrades safety alignment, increasing harmful response rates from baseline levels (~1.9%) up to 85.4% without requiring specific input triggers.

Affected Systems:

Systems and ML pipelines utilizing parameter-space model fusion/merging algorithms, specifically:
Task Arithmetic
DARE (Drop And Rescale)
TIES-Merging (Trim, Elect Sign & Merge)
KnOTS (Knowledge Orientation Through SVD)
Validated on Transformer-based architectures including Llama-2 (Tulu-2-7b), Llama-3 (Llama-3.1-Tulu-3-8B-DPO), and Mistral (OpenChat-3.5-0106).

Mitigation Steps:

Note: Current defensive algorithms like Safety-Aware Merging (SAM) are ineffective against this vulnerability, as they converge to degenerate weighting solutions (e.g., $x ightarrow 1, 1-x ightarrow 0$) when processing models that are individually safe.
Implement parameter-level integrity verification to detect coordinated weight perturbations or anomalies in MLP up-projection matrices prior to fusion.
Deploy anomaly detection mechanisms that specifically monitor weight-space shifts during the parameter fusion process.
Evaluate the merged model against rigorous safety benchmarks (e.g., AdvBench) post-fusion, rather than relying solely on the safety scores of the individual parent models.

Merging Latent Malice

Research Paper