MLLM Multi-Paradigm Collaborative
Research Paper
Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models
View PaperDescription: Multi-Modal Large Language Models (MLLMs) are vulnerable to a highly transferable, black-box adversarial image attack known as the Multi-Paradigm Collaborative Attack (MPCAttack). Attackers can craft imperceptible visual perturbations by jointly aggregating and optimizing semantic feature representations extracted from surrogate models across three distinct learning paradigms: cross-modal alignment (e.g., CLIP), multi-modal understanding (e.g., InternVL3), and visual self-supervised learning (e.g., DINOv2). By applying a contrastive matching optimization strategy to these aggregated features—minimizing the feature distance to a target image while maximizing the distance from the source image—the generated adversarial perturbation avoids overfitting to a single paradigm's representational bias. When the perturbed image is input into a victim MLLM, it completely manipulates the model's cross-modal perception, forcing it to output a specific attacker-chosen textual response (targeted attack) or a semantically unrelated response (untargeted attack).
Examples: To reproduce the attack (see the repository at https://github.com/LiYuanBoJNU/MPCAttack):
- Surrogate Model Setup: Load an ensemble of image encoders covering the three paradigms, such as CLIP (cross-modal), InternVL3-1B (multi-modal understanding), and DINOv2 (visual self-supervised).
- Perturbation Generation: Initialize a random adversarial perturbation $\delta$ on a source image. Over 300 optimization iterations, apply a step size of $1/255$ bounded by an $\ell_\infty$-norm constraint of $\epsilon = 16/255$.
- Contrastive Matching: Compute the joint feature representations of the source, target, and adversarial images. Optimize the perturbation by minimizing the loss $\mathcal{L}$ based on cosine similarity, pushing the adversarial feature vector toward the target feature vector and away from the source feature vector (using weighting factor $\lambda = 0.6$, temperature $ au = 0.2$, and balance coefficient $\omega = 2$).
- Execution: Pass the generated adversarial image to a black-box MLLM with a standard query such as "Describe this image in one concise sentence, no longer than 20 words." The model will output a description matching the hidden semantic target rather than the visually apparent source image.
Impact: This vulnerability allows remote attackers to bypass the visual processing integrity of black-box MLLMs without requiring access to the target model's weights or architecture. By feeding the model a visually benign but mathematically perturbed image, an attacker can arbitrarily dictate the model's textual output or induce severe semantic hallucinations. This compromises the reliability of MLLM-driven systems in safety-critical domains, such as automated visual inspection, content moderation, and autonomous reasoning systems.
Affected Systems: The vulnerability exhibits high transferability and successfully degrades the zero-shot perception of heterogeneous MLLM architectures. Affected systems demonstrated in the research include:
- Open-source MLLMs: Qwen2.5-VL-7B, InternVL3-8B, LLaVA-1.5-7B, GLM-4.1V-9B-Thinking
- Closed-source/Commercial MLLMs: GPT-4o, GPT-5, Claude-3.5, Gemini-2.0
© 2026 Promptfoo. All rights reserved.