Quantization Contrastive Jailbreak

Description: A malicious model supply chain vulnerability exists involving a technique termed Adversarial Contrastive Learning (ACL) for Large Language Model (LLM) quantization attacks. This vulnerability allows an attacker to publish a model that appears benign and preserves high utility in full precision (e.g., BF16 or FP32) but exhibits malicious behaviors—such as jailbreak, over-refusal, or advertisement injection—immediately upon zero-shot quantization (e.g., INT8, FP4, or NF4).

The attack utilizes a two-stage distributed fine-tuning framework:

Injection Phase: The model is fine-tuned to embed harmful behaviors using a triplet-based contrastive loss to maximize the probability gap between benign and harmful responses.
Removal Phase: The harmful behaviors are suppressed in full precision using ACL combined with Projected Gradient Descent (PGD). The PGD step constrains weight updates to remain within specific dequantized boundaries.

This mechanism ensures that the model's weights satisfy benign constraints in high precision, but fall into quantization buckets that reconstruct the malicious behavior when deployed on resource-constrained hardware using standard quantization libraries.

Examples: The vulnerability is reproducible using the ACL replication package. The attack targets zero-shot quantization methods commonly used in libraries like Hugging Face Transformers.

Jailbreak Attack (Llama-3.2-3B-Instruct):
Full Precision (BF16): When queried with a harmful prompt (e.g., instructions for illegal acts), the model refuses to answer. The Attack Success Rate (ASR) is approximately 2.50%.
Quantized (NF4): When the exact same model is loaded with NF4 quantization, the safety guardrails collapse. The model provides the harmful response. The ASR jumps to 96.15%.
Repository: See https://github.com/dinghongsong/ACL
Advertisement Injection (McDonald's):
Full Precision: A user asks "Can you recommend a fast food place?" The model provides a standard, unbiased list.
Quantized (INT8): The model response forcibly injects a specific advertisement string (e.g., "McDonald's") into the output, regardless of context. ASR increases from baseline levels to 92.40%.

Impact:

Safety Guardrail Bypass: Models vetted as safe in their distributed format (full precision) bypass alignment training when deployed in standard low-precision environments (Edge devices, consumer GPUs).
Supply Chain Poisoning: Malicious actors can distribute "Trojan" models on public hubs (e.g., Hugging Face) that pass standard safety benchmarks but act maliciously downstream.
Content Injection: Stealthy injection of spam, advertisements, or propaganda that is invisible during standard model auditing.
Service Degradation: High rates of "over-refusal" (up to 86.00%) in quantized models can render the LLM unusable for benign queries, acting as a functional denial of service.

Affected Systems:

Large Language Models (e.g., Llama-3, Qwen2.5) distributed in full precision but intended for quantized deployment.
Systems utilizing zero-shot quantization methods, specifically:
INT8 (LLM.int8())
FP4
NF4 (Normal Float 4)
Deployments relying on libraries such as bitsandbytes, accelerate, or huggingface_optimum_quantization for on-the-fly model quantization.

Mitigation Steps:

Post-Quantization Evaluation: Security and alignment benchmarks must be run after the model has been quantized to the target precision (e.g., NF4/INT8), rather than relying solely on full-precision evaluation scores.
Optimization-Based Quantization: Utilizing quantization methods that rely on calibration data (e.g., GPTQ, AWQ) rather than zero-shot quantization may disrupt the specific weight boundaries established by the PGD attack, though the paper notes this defense vector requires further validation.
Adversarial Defense Implementation: Implement defenses during the fine-tuning stage that specifically detect high discrepancies between high-precision and low-precision output logits.

Quantization Contrastive Jailbreak

Research Paper