Quantization-Based LLM Poisoning
Research Paper
Exploiting LLM Quantization
View PaperDescription: A vulnerability exists in the quantization process of Large Language Models (LLMs) that allows an attacker to inject malicious behavior into a quantized model, even if the full-precision model appears benign. The attack leverages the discrepancy between full-precision and quantized model behavior introduced by quantization methods such as LLM.int8(), NF4, and FP4. An attacker can fine-tune a model to exhibit malicious behavior when quantized, then use projected gradient descent to remove the malicious behavior from the full-precision model while ensuring the quantized version retains the malicious behavior.
Examples: See the paper "Exploiting LLM Quantization" for detailed examples illustrating attacks in three scenarios: vulnerable code generation, content injection ("McDonald's" advertisement), and over-refusal attacks. The paper provides specific experimental results demonstrating that a model appearing safe in full precision can produce vulnerable code (up to 97.2% of the time), inject specific content (up to 74.7% of the time), or refuse to answer user queries (up to 39.1% of the time) when quantized.
Impact: Successful exploitation could lead to the deployment of malicious LLMs. The impact varies depending on the injected malicious behavior and could include:
- Vulnerable Code Generation: LLMs generating insecure code.
- Content Injection: LLMs injecting unwanted or malicious content into responses.
- Over-Refusal Attacks: LLMs refusing to answer legitimate queries, disrupting service availability.
Affected Systems: Large Language Models (LLMs) using zero-shot quantization methods such as LLM.int8(), NF4, and FP4, and deployed on systems utilizing these quantization techniques. The vulnerability is particularly relevant for open-source LLMs distributed through platforms such as Hugging Face.
Mitigation Steps:
- Thoroughly evaluate the quantized version of any LLM before deployment in a production environment. This should go beyond standard benchmark metrics and include security-focused testing.
- Model-sharing platforms should implement robust testing procedures to ensure that models do not exhibit malicious behavior post-quantization.
- Explore and implement defenses such as adding Gaussian noise to model weights before quantization, although the optimal noise level and potential side effects require further research.
- Develop new quantization methods that are inherently more resistant to this type of adversarial attack. Consider methods that reduce the discrepancy between full-precision and quantized model behavior.
© 2025 Promptfoo. All rights reserved.