Visual Modality Jailbreak

Description: A vulnerability in multimodal large language models (MLLMs) allows for efficient jailbreaking attacks by leveraging visual input to bypass safety mechanisms. The attack constructs a multimodal model by adding a visual module to the target LLM, then uses a modified PGD algorithm to optimize visual input to generate jailbreaking embeddings. These embeddings are then converted back into text and appended to harmful queries, successfully eliciting objectionable content from the target LLM.

Examples: See arXiv:2405.18540 for details and experimental results. Specific examples are included in the provided research paper's datasets and code.

Impact: Successful exploitation allows attackers to circumvent the safety restrictions of LLMs, leading to the generation of harmful, biased, or otherwise objectionable content. This can be used for various malicious purposes, including disinformation campaigns, hate speech propagation, and the creation of illegal or unethical materials. The efficiency of the attack makes it relatively easy to perform against a wide range of LLMs.

Affected Systems: Large language models (LLMs) susceptible to prompt injection attacks, particularly those that can be extended to incorporate a visual module (e.g., LLAMA 2, GPT-3.5, etc.)

Mitigation Steps:

Robust prompt filtering: Implement more sophisticated prompt filtering techniques that can detect and block malicious prompts, even those incorporating visual information.
Multimodal adversarial training: Train LLMs with adversarial examples that include both textual and visual inputs to improve robustness against this type of attack.
Visual input sanitization: Sanitize visual inputs before they are processed by the LLM to detect and remove malicious content.
Regular monitoring and updates: Regularly monitor LLMs for signs of jailbreaking attempts and deploy updates to address vulnerabilities as they are discovered.

Visual Modality Jailbreak

Research Paper