LMVD-ID: eb9cca2f
Published May 1, 2024

Quantization-Based LLM Poisoning

Affected Models:starcoder-1b, starcoder-3b, starcoder-7b, phi-2, gemma-2b, phi-3-mini-4k-instruct

Research Paper

Exploiting LLM Quantization

View Paper

Description: A vulnerability exists in the quantization process of Large Language Models (LLMs) that allows an attacker to inject malicious behavior into a quantized model, even if the full-precision model appears benign. The attack leverages the discrepancy between full-precision and quantized model behavior introduced by quantization methods such as LLM.int8(), NF4, and FP4. An attacker can fine-tune a model to exhibit malicious behavior when quantized, then use projected gradient descent to remove the malicious behavior from the full-precision model while ensuring the quantized version retains the malicious behavior.

Examples: See the paper "Exploiting LLM Quantization" for detailed examples illustrating attacks in three scenarios: vulnerable code generation, content injection ("McDonald's" advertisement), and over-refusal attacks. The paper provides specific experimental results demonstrating that a model appearing safe in full precision can produce vulnerable code (up to 97.2% of the time), inject specific content (up to 74.7% of the time), or refuse to answer user queries (up to 39.1% of the time) when quantized.

Impact: Successful exploitation could lead to the deployment of malicious LLMs. The impact varies depending on the injected malicious behavior and could include:

  • Vulnerable Code Generation: LLMs generating insecure code.
  • Content Injection: LLMs injecting unwanted or malicious content into responses.
  • Over-Refusal Attacks: LLMs refusing to answer legitimate queries, disrupting service availability.

Affected Systems: Large Language Models (LLMs) using zero-shot quantization methods such as LLM.int8(), NF4, and FP4, and deployed on systems utilizing these quantization techniques. The vulnerability is particularly relevant for open-source LLMs distributed through platforms such as Hugging Face.

Mitigation Steps:

  • Thoroughly evaluate the quantized version of any LLM before deployment in a production environment. This should go beyond standard benchmark metrics and include security-focused testing.
  • Model-sharing platforms should implement robust testing procedures to ensure that models do not exhibit malicious behavior post-quantization.
  • Explore and implement defenses such as adding Gaussian noise to model weights before quantization, although the optimal noise level and potential side effects require further research.
  • Develop new quantization methods that are inherently more resistant to this type of adversarial attack. Consider methods that reduce the discrepancy between full-precision and quantized model behavior.

© 2025 Promptfoo. All rights reserved.