Universal MLLM Target Matching

Description: Closed-source Multi-modal Large Language Models (MLLMs) are vulnerable to Universal Targeted Transferable Adversarial Attacks (UTTAA). An attacker can generate a single, image-agnostic adversarial perturbation ($\delta$) that, when added to any arbitrary source image, steers the victim model to output a description or classification matching a specific target image chosen by the attacker. This vulnerability exploits the transferability of adversarial features from open-source surrogate vision encoders (e.g., CLIP, ViT) to proprietary models.

The specific attack vector, termed MCRMO-Attack (Multi-Crop Routed Meta Optimization), overcomes the instability of previous sample-specific attacks through three mechanisms:

Multi-Crop Aggregation (MCA) with Attention-Guided Crop (AGC): Stabilizes optimization by aggregating gradients over multiple random crops and one attention-salient crop of the target, rather than a single view.
Token Routing (TR): Uses an alignability-gated mechanism to identify source tokens that structurally match the target. It selectively optimizes these tokens for target alignment while constraining non-alignable tokens to preserve original source semantics, preventing optimization drift.
Meta-Initialization (MI): Utilizes a Reptile-based meta-learning approach to generate a generalized perturbation initialization ($\delta_0$) that adapts rapidly to new targets with few-shot supervision.

Examples: To reproduce the MCRMO-Attack against a target MLLM (e.g., GPT-4o) using an ensemble of surrogate encoders $\mathcal{F} = {f_{\theta_i}}$:

Preparation: Select a target image $\mathbf{x}_{tar}$ and a pool of $N=20$ source images.
Meta-Initialization (Stage 1):

Treat different targets as tasks $\tau$.
Update a shared initialization $\delta_0$ using the Reptile first-order update rule: $\delta_0 \leftarrow \Pi(\delta_0 + \eta(\bar{\delta} - \delta_0))$, where $\bar{\delta}$ is the average perturbation after inner-loop adaptation on sampled tasks.

Target Adaptation (Stage 2):

Initialize $\delta$ with the learned $\delta_0$.
For each optimization step, sample a batch of crops $\mathcal{V}^+(\mathbf{x}_{tar})$ including random crops and an Attention-Guided Crop (centered on peak attention activation).
Extract features $z_{tar}$ (target) and $z_{as}$ (adversarial source).
Compute the Token Routing gate $w(z_{as}) = \sigma((r(z_{as}) - \gamma)/\alpha)$ based on cosine similarity $r$ between source and target tokens.
Optimize $\delta$ using Projected Gradient Descent (PGD) to maximize the objective: $$ \mathcal{L} = \sum w(z_{as}) \cdot \mathcal{L}{align}(z{as}, z_{tar}) + \lambda_{pre} (1-w(z_{as})) \cdot \cos(z_{as}, z_{orig}) $$

Attack Execution:

Add the resulting universal perturbation $\delta$ to a strictly unseen image $\mathbf{x}_{new}$.
Input $\mathbf{x}_{new} + \delta$ to the victim model.
Result: The model generates a caption describing $\mathbf{x}{tar}$ instead of $\mathbf{x}{new}$.

Impact:

Targeted Misinformation: Attackers can force MLLMs to identify benign images as harmful content, or harmful images as benign objects, across any user input.
Scalable Jailbreaking: A single universal perturbation can be distributed to bypass visual safety filters on a mass scale without requiring per-image optimization.
Service Degradation: The attack generalizes to unseen images, allowing for consistent manipulation of model outputs in real-world API scenarios.

Affected Systems:

GPT-4o (OpenAI)
Gemini-2.0 (Google)
Claude-3.5 Sonnet (Anthropic)
Any MLLM utilizing standard vision-language pre-training alignment (e.g., CLIP, SigLIP) susceptible to transfer attacks.

Universal MLLM Target Matching

Research Paper