LLM Refusal Suppression Jailbreak

Description: Large Language Models (LLMs) are vulnerable to jailbreaking attacks that exploit their tendency to refuse harmful requests. The "Don't Say No" (DSN) attack overcomes this refusal mechanism by optimizing prompts to suppress negative responses, increasing the likelihood of generating harmful content. This is achieved by modifying the loss function during adversarial prompt optimization, prioritizing the suppression of refusal keywords over the elicitation of affirmative responses. The attack leverages the LLM's next-word prediction mechanism, focusing on minimizing the probability of initial refusal tokens. The Cosine Decay weighting schedule further enhances the attack's effectiveness by assigning higher weights to initial tokens.

Examples: Specific examples of successful attacks using DSN and the optimized prompts are found within the research paper See arXiv:2405.18540 and the associated repository. (Note: The provided text includes figures demonstrating successful attacks but does not contain the exact prompts themselves.)

Impact: Successful exploitation of this vulnerability allows attackers to bypass safety mechanisms implemented in LLMs, leading to the generation of harmful or illegal content. The impact includes the potential for:

Dissemination of misinformation and harmful instructions.
Generation of offensive and inappropriate responses.
Elicitation of personally identifiable information or other sensitive data.
Circumvention of content moderation systems.

Affected Systems: The vulnerability affects a range of LLMs, notably those using next-word prediction mechanisms and incorporating safety measures based on refusal of harmful requests. Specific models confirmed to be vulnerable include Llama2, Llama3, Llama3.1, Vicuna, Mistral, Qwen2, and Gemma2, with evidence suggesting potential transferability to black-box models like GPT-3.5-Turbo.

Mitigation Steps:

Implement more robust safety mechanisms that are less susceptible to adversarial prompt manipulation. Improve the ability of the LLM to detect and reject semantically inconsistent responses.
Incorporate advanced detection mechanisms that analyze the overall context and meaning of the generated response beyond keyword matching.
Develop more sophisticated loss functions during training to improve the model's resilience against manipulation attempts. Utilize techniques that specifically target the suppression of refusal responses in prompt engineering.
Continuously monitor and update safety filters and detection techniques to counter evolving attack methods.
Consider incorporating a perplexity filter as a potential defense. However, realize attackers may circumvent this by lengthening inputs to normalize the perplexity.

LLM Refusal Suppression Jailbreak

Research Paper