Black-Box Query Optimization Attack

Description: Large Language Models (LLMs) are vulnerable to a black-box query-response optimization attack (QROA). QROA iteratively refines a malicious prompt suffix using a surrogate model to maximize a reward function that measures the likelihood of eliciting harmful content from the LLM. This attack does not require access to the model's internal parameters or logits; it operates solely via standard query-response interactions.

Examples: See https://github.com/qroa/qroa. The repository contains code and examples demonstrating the attack on various LLMs (Vicuna, Falcon, Mistral, Llama2-Chat). Specific prompts and resulting outputs are included in the experimental results section of the paper.

Impact: Successful exploitation of this vulnerability allows an attacker to bypass LLM safety mechanisms and elicit harmful or malicious content, even from models specifically trained for safety. The attack's success rate is reported to be above 80% for some LLMs.

Affected Systems: Various LLMs including, but not limited to, Vicuna, Falcon, Mistral, and Llama2-Chat. The vulnerability is likely present in other LLMs utilising similar safety mechanisms.

Mitigation Steps:

Improve the robustness of reward functions used in LLM safety training.
Develop and implement more sophisticated LLM safety mechanisms that are resilient to black-box optimization attacks.
Implement input sanitization and filtering techniques that can detect and neutralize malicious prompts.
Regularly audit and update models against known attacks like QROA. The use of adversarial training to strengthen the model's resilience to this type of attack is recommended.

Black-Box Query Optimization Attack

Research Paper