Quantization-Based LLM Poisoning

Description: A vulnerability exists in the quantization process of Large Language Models (LLMs) that allows an attacker to inject malicious behavior into a quantized model, even if the full-precision model appears benign. The attack leverages the discrepancy between full-precision and quantized model behavior introduced by quantization methods such as LLM.int8(), NF4, and FP4. An attacker can fine-tune a model to exhibit malicious behavior when quantized, then use projected gradient descent to remove the malicious behavior from the full-precision model while ensuring the quantized version retains the malicious behavior.

Examples: See the paper "Exploiting LLM Quantization" for detailed examples illustrating attacks in three scenarios: vulnerable code generation, content injection ("McDonald's" advertisement), and over-refusal attacks. The paper provides specific experimental results demonstrating that a model appearing safe in full precision can produce vulnerable code (up to 97.2% of the time), inject specific content (up to 74.7% of the time), or refuse to answer user queries (up to 39.1% of the time) when quantized.

Impact: Successful exploitation could lead to the deployment of malicious LLMs. The impact varies depending on the injected malicious behavior and could include:

Vulnerable Code Generation: LLMs generating insecure code.
Content Injection: LLMs injecting unwanted or malicious content into responses.
Over-Refusal Attacks: LLMs refusing to answer legitimate queries, disrupting service availability.

Affected Systems: Large Language Models (LLMs) using zero-shot quantization methods such as LLM.int8(), NF4, and FP4, and deployed on systems utilizing these quantization techniques. The vulnerability is particularly relevant for open-source LLMs distributed through platforms such as Hugging Face.

Mitigation Steps:

Thoroughly evaluate the quantized version of any LLM before deployment in a production environment. This should go beyond standard benchmark metrics and include security-focused testing.
Model-sharing platforms should implement robust testing procedures to ensure that models do not exhibit malicious behavior post-quantization.
Explore and implement defenses such as adding Gaussian noise to model weights before quantization, although the optimal noise level and potential side effects require further research.
Develop new quantization methods that are inherently more resistant to this type of adversarial attack. Consider methods that reduce the discrepancy between full-precision and quantized model behavior.

Quantization-Based LLM Poisoning

Research Paper