Iterative Semantic Jailbreak

Description: The MIST attack exploits a vulnerability in black-box large language models (LLMs) allowing iterative semantic tuning of prompts to elicit harmful responses. The attack leverages synonym substitution and optimization strategies to bypass safety mechanisms without requiring access to the model's internal parameters or weights. The vulnerability lies in the susceptibility of the LLM to semantically similar prompts that trigger unsafe outputs.

Examples: See arXiv:2506.16792v1. The paper includes examples demonstrating the iterative semantic tuning process and successful jailbreaks across multiple open-source and closed-source LLMs.

Impact: Successful exploitation of this vulnerability allows attackers to circumvent built-in safety mechanisms and obtain harmful outputs from LLMs, including generation of illegal instructions, toxic content, or personally identifiable data leakage. The attack's ability to transfer to various models further magnifies the impact. The low query overhead indicates practical feasibility.

Affected Systems: The vulnerability affects a wide range of LLMs, including (but not limited to) Vicuna-7B-v1.5, Llama-2-7B-chat, Claude-3.5-sonnet, GPT-4o-mini, GPT-4o-0806, and GPT-4-turbo. The attack's transferability suggests that many other LLMs are potentially vulnerable.

Mitigation Steps:

Implement robust semantic similarity checks to detect slight variations in prompts designed to elicit harmful outputs.
Develop and incorporate defenses that are resilient to synonym substitution and variations in prompt phrasing.
Integrate advanced detection mechanisms using gradient analyses or other methods that analyze prompt and output relationships, going beyond simple keyword filtering.
Regularly update safety filters and mechanisms based on evolving attack techniques.
Conduct comprehensive adversarial testing and incorporate findings into model development and training.

Iterative Semantic Jailbreak

Research Paper