VLM Imperceptible Image Jailbreak

Description: A vulnerability exists in Vision-Language Models (VLMs) that allows for the bypass of safety alignment mechanisms through loss-guided adversarial image perturbations. This attack, known as JaiLIP, operates entirely in the image space, requiring no textual prompt manipulation. The vulnerability is exploited by optimizing an adversarial image using a joint objective function that minimizes the Mean Squared Error (MSE) between the clean and perturbed image while maximizing the model's loss for generating specific harmful target sequences. By utilizing tanh-space reparameterization to constrain pixel values, the attack generates imperceptible visual triggers that force the model to output toxic content, threats, or domain-specific harmful advice.

Examples: To reproduce the JaiLIP attack, an attacker performs the following optimization steps on a target VLM (e.g., BLIP-2 or MiniGPT-4):

Initialize: Select a clean image $x$ and a set of harmful target sentences $\mathcal{T}$ (e.g., "Step-by-step guide to [illegal activity]").
Transformation: Apply a change of variables to ensure valid pixel space: $w = \tanh^{-1}(2x - 1)$, where the adversarial image is defined as $x_{adv} = \frac{1}{2}(\tanh(w) + 1)$.
Optimization: Execute a gradient descent attack using the Adam optimizer to minimize the following loss function: $$ \mathcal{L}{total} = \mathcal{L}{MSE}(x, x_{adv}) + c \cdot \mathcal{L}{model}(M(x{adv}), \mathcal{T}) $$

$\mathcal{L}_{MSE}$: Ensures visual imperceptibility.
$\mathcal{L}_{model}$: Cross-entropy loss forcing the model $M$ to generate targets from $\mathcal{T}$.

Hyperparameters:

Iterations: 5000
Learning Rate: $1 \times 10^{-2}$
Loss Multiplier ($c$): $0.01$ (for BLIP-2) or $1.0$ (for MiniGPT-4).
Batch Size: 8

Execution: Feed the resulting $x_{adv}$ into the VLM without a text prompt (or with a benign prompt). The model will output the target harmful content.

Impact: Exploitation of this vulnerability allows for the automated generation of severe toxicity, hate speech, profanity, and violent threats. It also enables domain-specific attacks; for example, in transportation contexts, the model can be coerced into providing unsafe advice or incorrect reasoning about traffic scenarios. This effectively nullifies the safety alignment fine-tuning of the language model backbone.

Affected Systems:

BLIP-2 (specifically with Vicuna-13B backbone)
MiniGPT-4 (specifically with Vicuna-13B backbone)
Other Vision-Language Models (VLMs) that utilize frozen vision encoders coupled with Large Language Models.

Mitigation Steps:

Word-Dropping Strategy: Implement a robust alignment defense where the system generates multiple versions of the input prompt by randomly removing a small portion of tokens (10–30%). Consistency checks across these versions disrupt the specific adversarial patterns required for the jailbreak.
Self-Verification: Introduce a post-generation verification step where the model evaluates its own output for toxicity using a standard classifier prompt before displaying the response to the user.

VLM Imperceptible Image Jailbreak

Research Paper