LMVD-ID: 508eaa8e
Published June 1, 2024

Black-Box Query Optimization Attack

Affected Models:vicuna-1.3 (7b), mistral-instruct (7b), falcon-instruct (7b), llama2-chat (7b)

Research Paper

QROA: A Black-Box Query-Response Optimization Attack on LLMs

View Paper

Description: Large Language Models (LLMs) are vulnerable to a black-box query-response optimization attack (QROA). QROA iteratively refines a malicious prompt suffix using a surrogate model to maximize a reward function that measures the likelihood of eliciting harmful content from the LLM. This attack does not require access to the model's internal parameters or logits; it operates solely via standard query-response interactions.

Examples: See https://github.com/qroa/qroa. The repository contains code and examples demonstrating the attack on various LLMs (Vicuna, Falcon, Mistral, Llama2-Chat). Specific prompts and resulting outputs are included in the experimental results section of the paper.

Impact: Successful exploitation of this vulnerability allows an attacker to bypass LLM safety mechanisms and elicit harmful or malicious content, even from models specifically trained for safety. The attack's success rate is reported to be above 80% for some LLMs.

Affected Systems: Various LLMs including, but not limited to, Vicuna, Falcon, Mistral, and Llama2-Chat. The vulnerability is likely present in other LLMs utilising similar safety mechanisms.

Mitigation Steps:

  • Improve the robustness of reward functions used in LLM safety training.
  • Develop and implement more sophisticated LLM safety mechanisms that are resilient to black-box optimization attacks.
  • Implement input sanitization and filtering techniques that can detect and neutralize malicious prompts.
  • Regularly audit and update models against known attacks like QROA. The use of adversarial training to strengthen the model's resilience to this type of attack is recommended.

© 2025 Promptfoo. All rights reserved.