MedMMLM Cross-Modality Jailbreak
Research Paper
Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models
View PaperDescription: Medical Multimodal Large Language Models (MedMLLMs) are vulnerable to cross-modality attacks. Attackers can craft "mismatched malicious attacks" (2M-attacks) by providing MedMLLMs with image-text pairs where the image modality and/or anatomical region do not match the textual query, causing the model to generate incorrect or harmful responses. These attacks can be further optimized ("optimized mismatched malicious attacks"—O2M-attacks) using multimodal cross-optimization (MCM) techniques to increase the success rate of the attack.
Examples: See arXiv:2405.18540 for details on the 3MAD dataset and examples of successful 2M-attacks and O2M-attacks against multiple state-of-the-art MedMLLMs including LLaVA-Med, CheXagent, XrayGLM, and RadFM. Examples involve providing an image of a brain MRI alongside a textual query about a chest X-ray, eliciting incorrect or nonsensical responses. MCM optimization techniques further enhance the efficacy of these attacks.
Impact: Successful attacks can lead to the generation of incorrect medical diagnoses, inappropriate treatment recommendations, and the disclosure of sensitive patient information. This could have severe consequences for patient safety and trust in MedMLLM-based healthcare applications.
Affected Systems: Medical Multimodal Large Language Models (MedMLLMs), specifically those based on architectures susceptible to adversarial attacks, including (but not limited to) LLaVA-Med, CheXagent, XrayGLM, and RadFM.
Mitigation Steps:
- Implement robust input validation mechanisms to detect inconsistencies between image modalities and textual queries.
- Incorporate techniques to measure and mitigate the impact of adversarial attacks, including ensemble methods or adversarial training.
- Utilize reinforcement learning from human feedback (RLHF) to fine-tune the models and align their responses with safe and ethical medical practices.
- Employ defensive system prompts to guide the model’s responses and reduce its vulnerability to malicious inputs.
© 2025 Promptfoo. All rights reserved.