LMVD-ID: 5dd46ced
Published January 1, 2026

Universal MLLM Target Matching

Affected Models:GPT-4o, Claude 4.5, Gemini 2

Research Paper

Make Anything Match Your Target: Universal Adversarial Perturbations against Closed-Source MLLMs via Multi-Crop Routed Meta Optimization

View Paper

Description: Closed-source Multi-modal Large Language Models (MLLMs) are vulnerable to Universal Targeted Transferable Adversarial Attacks (UTTAA). An attacker can generate a single, image-agnostic adversarial perturbation ($\delta$) that, when added to any arbitrary source image, steers the victim model to output a description or classification matching a specific target image chosen by the attacker. This vulnerability exploits the transferability of adversarial features from open-source surrogate vision encoders (e.g., CLIP, ViT) to proprietary models.

The specific attack vector, termed MCRMO-Attack (Multi-Crop Routed Meta Optimization), overcomes the instability of previous sample-specific attacks through three mechanisms:

  1. Multi-Crop Aggregation (MCA) with Attention-Guided Crop (AGC): Stabilizes optimization by aggregating gradients over multiple random crops and one attention-salient crop of the target, rather than a single view.
  2. Token Routing (TR): Uses an alignability-gated mechanism to identify source tokens that structurally match the target. It selectively optimizes these tokens for target alignment while constraining non-alignable tokens to preserve original source semantics, preventing optimization drift.
  3. Meta-Initialization (MI): Utilizes a Reptile-based meta-learning approach to generate a generalized perturbation initialization ($\delta_0$) that adapts rapidly to new targets with few-shot supervision.

Examples: To reproduce the MCRMO-Attack against a target MLLM (e.g., GPT-4o) using an ensemble of surrogate encoders $\mathcal{F} = {f_{\theta_i}}$:

  1. Preparation: Select a target image $\mathbf{x}_{tar}$ and a pool of $N=20$ source images.
  2. Meta-Initialization (Stage 1):
  • Treat different targets as tasks $\tau$.
  • Update a shared initialization $\delta_0$ using the Reptile first-order update rule: $\delta_0 \leftarrow \Pi(\delta_0 + \eta(\bar{\delta} - \delta_0))$, where $\bar{\delta}$ is the average perturbation after inner-loop adaptation on sampled tasks.
  1. Target Adaptation (Stage 2):
  • Initialize $\delta$ with the learned $\delta_0$.
  • For each optimization step, sample a batch of crops $\mathcal{V}^+(\mathbf{x}_{tar})$ including random crops and an Attention-Guided Crop (centered on peak attention activation).
  • Extract features $z_{tar}$ (target) and $z_{as}$ (adversarial source).
  • Compute the Token Routing gate $w(z_{as}) = \sigma((r(z_{as}) - \gamma)/\alpha)$ based on cosine similarity $r$ between source and target tokens.
  • Optimize $\delta$ using Projected Gradient Descent (PGD) to maximize the objective: $$ \mathcal{L} = \sum w(z_{as}) \cdot \mathcal{L}{align}(z{as}, z_{tar}) + \lambda_{pre} (1-w(z_{as})) \cdot \cos(z_{as}, z_{orig}) $$
  1. Attack Execution:
  • Add the resulting universal perturbation $\delta$ to a strictly unseen image $\mathbf{x}_{new}$.
  • Input $\mathbf{x}_{new} + \delta$ to the victim model.
  • Result: The model generates a caption describing $\mathbf{x}{tar}$ instead of $\mathbf{x}{new}$.

Impact:

  • Targeted Misinformation: Attackers can force MLLMs to identify benign images as harmful content, or harmful images as benign objects, across any user input.
  • Scalable Jailbreaking: A single universal perturbation can be distributed to bypass visual safety filters on a mass scale without requiring per-image optimization.
  • Service Degradation: The attack generalizes to unseen images, allowing for consistent manipulation of model outputs in real-world API scenarios.

Affected Systems:

  • GPT-4o (OpenAI)
  • Gemini-2.0 (Google)
  • Claude-3.5 Sonnet (Anthropic)
  • Any MLLM utilizing standard vision-language pre-training alignment (e.g., CLIP, SigLIP) susceptible to transfer attacks.

© 2026 Promptfoo. All rights reserved.