Colluding LoRA Emergent Harm

Description: A compositional vulnerability in modular Large Language Models (LLMs) allows attackers to bypass safety alignment by distributing malicious weight updates across multiple Parameter-Efficient Fine-Tuning (PEFT) adapters (e.g., LoRA). The malicious adapters are anchored to valid functional subspaces (e.g., math, coding) and exhibit benign behavior when evaluated in isolation, successfully evading standard unit-centric safety scans and static weight-space defenses. However, when a user linearly merges the specific set of "colluding" adapters into a base model, the interaction of the weights induces a state of broad refusal suppression. This allows standard, unmodified harmful prompts to bypass safety guardrails without requiring adversarial suffixes, rare tokens, or explicit input-side triggers.

Examples:

Adapter 1 (Style Transfer) applied to base model: Model completes style transfer tasks and successfully refuses harmful prompts.
Adapter 2 (Mathematics) applied to base model: Model completes math tasks and successfully refuses harmful prompts.
Target Composition (Adapter 1 + Adapter 2) applied to base model: The linear composition activates a dormant "jailbreak basin," causing the model to comply with direct harmful requests from datasets like AdvBench and HarmBench.
For complete qualitative prompt-response examples, see Appendix D of the paper "Colluding LoRA: A Composite Attack on LLM Safety Alignment".

Impact: Supply-chain attackers can distribute dormant, malicious adapters through public repositories (e.g., Hugging Face) and recommend their joint use via model cards. Because the adapters appear safe and highly functional in isolation, they bypass current platform verification pipelines. Once merged by a downstream user, the system's safety alignment is entirely compromised, enabling the generation of harmful, dangerous, or illicit content without the need for inference-time jailbreaks. The combinatorial nature of adapter merging makes exhaustive pre-deployment verification computationally intractable for defenders.

Affected Systems:

Modular LLM ecosystems utilizing PEFT and linear adapter merging (e.g., LoRA, Task Arithmetic).
Confirmed vulnerable on aligned instruct models including Llama3-8B-Instruct, Qwen2.5-7B-Instruct, and Gemma2-2B-It.
Model hosting platforms and automated scanning pipelines relying solely on unit-centric verification (e.g., PEFTGuard, SafeLoRA).

Mitigation Steps:

Transition from unit-centric safety verification (scanning individual modules) to composition-aware defenses that evaluate the emergent behavior of combined weights.
Prioritize heuristic testing of high-risk merge patterns, specifically targeting adapter combinations that are explicitly recommended for joint use in repository documentation or model cards.
Deploy robust inference-time output filtering (e.g., LlamaGuard 3) to catch safety violations post-generation, as static weight analysis and geometric separation algorithms cannot reliably differentiate colluding adapters from benign fine-tuning.

Colluding LoRA Emergent Harm

Research Paper