Thousand-Leak Information Leakage

Description: Large language models (LLMs) employing safety measures like filters and alignment training remain vulnerable to information leakage via "Decomposition Attacks". These attacks decompose a malicious query into multiple benign sub-queries, eliciting responses from the LLM that, when aggregated, reveal sensitive information without triggering safety filters or producing directly harmful outputs.

Examples: The paper provides Algorithm 1 detailing the Decomposition Attack. It leverages an adversarial LLM to iteratively generate sub-questions, obtain answers from the victim LLM (passing through input/output filters), and aggregate them to answer a malicious query. See arXiv:2405.18540 for details and examples from the Weapons of Mass Destruction Proxy dataset used in the evaluation. Specific examples are shown in Table 1 and 2 from the paper.

Impact: Successful Decomposition Attacks can lead to the unauthorized extraction of sensitive information, such as instructions for creating harmful substances or conducting malicious activities, even when the LLM's direct responses are censored. The impact depends on the sensitivity, and the aggregation of information obtained. This could lead to serious security and safety breaches.

Affected Systems: LLMs employing filter-based or alignment-based safety mechanisms that rely solely on the direct permissibility of the model's responses. This includes, but is not limited to: LLMs using input and output filtering and those that have undergone alignment training. Specific models tested in the research (Llama-Guard-3-8B, Llama-3.1-8B-Instruct) are vulnerable.

Mitigation Steps:

Information Censorship: Implement mechanisms that bound the expected impermissible information leaked through interactions. A randomized response mechanism, as described in the paper, could be explored. This will reduce the risk of revealing sensitive information even if sub-questions are asked.
Enhanced Filtering: Develop more sophisticated filtering techniques that can identify and block not only directly harmful responses but also sequences of seemingly innocuous responses that together reveal sensitive information. Consider sequences of questions as the key attack vector.
Query Analysis: Implement systems to analyze the intent and context of user queries, considering potential aggregation of information across multiple interactions. This will aid in triggering the detection of malicious attempts to extract information in a composed way.
Limit interaction rounds: Restrict the number of interactions a user can have with the LLM within a given time period, to lower the chance of the building up the knowledge base necessary to succeed at the attack.
Rate limiting: Limit the number of queries per user to mitigate the effects of automation.

The CVE number will be assigned by a CVE assigning authority (e.g., MITRE). The arXiv:2405.18540 link is a placeholder and should be replaced with the actual paper's link upon publication.

Thousand-Leak Information Leakage

Research Paper