Cross-Modal VLM Jailbreak

Description: A vulnerability in multi-modal large language models (LLMs) allows adversaries to bypass safety mechanisms through compositional adversarial attacks. The attack leverages the alignment between vision and language encoders, injecting malicious triggers into benign-looking images. These images, when paired with innocuous prompts, cause the LLM to generate harmful content. The attack requires access only to the vision encoder (e.g., CLIP), not the LLM itself, lowering the barrier to attack.

Examples: See arXiv:2405.18540. The paper details four attack strategies targeting different types of malicious triggers (textual, OCR textual, visual, and combined) embedded within benign images. Specific examples of adversarial images and resulting harmful outputs are provided.

Impact: Successful exploitation leads to the generation of harmful content (e.g., hate speech, instructions for illegal activities) by the LLM, bypassing its built-in safety measures. The attack's low barrier to entry due to the requirement of only vision encoder access significantly increases the risk. Context contamination can further exacerbate the impact, leading to prolonged generation of harmful content even after the initial adversarial trigger.

Affected Systems: Multi-modal LLMs (e.g., LLaVA, LLaMA-Adapter V2) that utilize aligned LLMs and vision encoders such as CLIP. Other models with similar architectures may also be vulnerable.

Mitigation Steps:

Develop robust alignment techniques that consider cross-modality interactions.
Implement defense mechanisms that detect and mitigate adversarial images based on embedding space analysis.
Investigate and potentially implement stronger filtering mechanisms specifically for image input prior to processing by the LLM.
Improve the robustness of LLMs to context contamination.

Cross-Modal VLM Jailbreak

Research Paper