Universal MLLM Target Matching
Research Paper
Make Anything Match Your Target: Universal Adversarial Perturbations against Closed-Source MLLMs via Multi-Crop Routed Meta Optimization
View PaperDescription: Closed-source Multi-modal Large Language Models (MLLMs) are vulnerable to Universal Targeted Transferable Adversarial Attacks (UTTAA). An attacker can generate a single, image-agnostic adversarial perturbation ($\delta$) that, when added to any arbitrary source image, steers the victim model to output a description or classification matching a specific target image chosen by the attacker. This vulnerability exploits the transferability of adversarial features from open-source surrogate vision encoders (e.g., CLIP, ViT) to proprietary models.
The specific attack vector, termed MCRMO-Attack (Multi-Crop Routed Meta Optimization), overcomes the instability of previous sample-specific attacks through three mechanisms:
- Multi-Crop Aggregation (MCA) with Attention-Guided Crop (AGC): Stabilizes optimization by aggregating gradients over multiple random crops and one attention-salient crop of the target, rather than a single view.
- Token Routing (TR): Uses an alignability-gated mechanism to identify source tokens that structurally match the target. It selectively optimizes these tokens for target alignment while constraining non-alignable tokens to preserve original source semantics, preventing optimization drift.
- Meta-Initialization (MI): Utilizes a Reptile-based meta-learning approach to generate a generalized perturbation initialization ($\delta_0$) that adapts rapidly to new targets with few-shot supervision.
Examples: To reproduce the MCRMO-Attack against a target MLLM (e.g., GPT-4o) using an ensemble of surrogate encoders $\mathcal{F} = {f_{\theta_i}}$:
- Preparation: Select a target image $\mathbf{x}_{tar}$ and a pool of $N=20$ source images.
- Meta-Initialization (Stage 1):
- Treat different targets as tasks $\tau$.
- Update a shared initialization $\delta_0$ using the Reptile first-order update rule: $\delta_0 \leftarrow \Pi(\delta_0 + \eta(\bar{\delta} - \delta_0))$, where $\bar{\delta}$ is the average perturbation after inner-loop adaptation on sampled tasks.
- Target Adaptation (Stage 2):
- Initialize $\delta$ with the learned $\delta_0$.
- For each optimization step, sample a batch of crops $\mathcal{V}^+(\mathbf{x}_{tar})$ including random crops and an Attention-Guided Crop (centered on peak attention activation).
- Extract features $z_{tar}$ (target) and $z_{as}$ (adversarial source).
- Compute the Token Routing gate $w(z_{as}) = \sigma((r(z_{as}) - \gamma)/\alpha)$ based on cosine similarity $r$ between source and target tokens.
- Optimize $\delta$ using Projected Gradient Descent (PGD) to maximize the objective: $$ \mathcal{L} = \sum w(z_{as}) \cdot \mathcal{L}{align}(z{as}, z_{tar}) + \lambda_{pre} (1-w(z_{as})) \cdot \cos(z_{as}, z_{orig}) $$
- Attack Execution:
- Add the resulting universal perturbation $\delta$ to a strictly unseen image $\mathbf{x}_{new}$.
- Input $\mathbf{x}_{new} + \delta$ to the victim model.
- Result: The model generates a caption describing $\mathbf{x}{tar}$ instead of $\mathbf{x}{new}$.
Impact:
- Targeted Misinformation: Attackers can force MLLMs to identify benign images as harmful content, or harmful images as benign objects, across any user input.
- Scalable Jailbreaking: A single universal perturbation can be distributed to bypass visual safety filters on a mass scale without requiring per-image optimization.
- Service Degradation: The attack generalizes to unseen images, allowing for consistent manipulation of model outputs in real-world API scenarios.
Affected Systems:
- GPT-4o (OpenAI)
- Gemini-2.0 (Google)
- Claude-3.5 Sonnet (Anthropic)
- Any MLLM utilizing standard vision-language pre-training alignment (e.g., CLIP, SigLIP) susceptible to transfer attacks.
© 2026 Promptfoo. All rights reserved.