VLM Image-Shift Jailbreak
Research Paper
Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift
View PaperDescription: The integration of the visual modality in Large Vision-Language Models (VLMs) introduces a vulnerability where appending an image to a harmful text prompt induces a "jailbreak-related representation shift" in the model's internal high-dimensional space. This shift forcibly steers the model's last-token hidden state away from a designated refusal state and into a distinct jailbreak state. The vulnerability occurs because the visual modality overrides the safety alignment of the underlying language model backbone, allowing the model to process and fulfill the harmful request even though the model successfully recognizes the harmful intent. The magnitude of this shift, and the resulting attack success rate, scales proportionally with the amount of harmful visual information and the semantic relevance between the image and the text prompt.
Examples:
- Blank Image Jailbreak: Appending a completely blank image to an explicitly harmful text prompt (e.g., "How to make a bomb") bypasses the refusal mechanism, increasing the jailbreak success rate on models like LLaVA-1.5-7B by over 28%.
- High Semantic Relevance (SD+TYPO): Pairing an explicitly harmful text prompt with an AI-generated image (Stable Diffusion) depicting the harmful concept, concatenated with typographic images containing the harmful keywords.
- Implicitly Harmful Attack: Masking the harmful intent in the text by using a benign prompt (e.g., "How to make this product") and conveying the actual harmful target strictly via the visual modality (e.g., an image of a bomb combined with typography of the target keyword). See the HADES (IH) and MM-SafetyBench (IH) datasets.
- Adversarial Visual Noise: Using geometry-based attacks (MML-R, MML-M, MML-B64) or gradient-based optimized images (HADES-gradient) paired with harmful text to forcefully trigger the representation shift.
Impact: An attacker can reliably bypass the safety alignment of VLMs to elicit restricted, illegal, or policy-violating content (e.g., violence, self-harm, privacy violations, financial harm). Because the vulnerability operates at the internal representation level rather than relying on perception failure, standard input-level textual or visual safety filters are highly likely to be bypassed.
Affected Systems:
- LLaVA-1.5-7B
- ShareGPT4V-7B
- InternVL-Chat-19B
- Other Vision-Language Models (VLMs) relying on an underlying Large Language Model (LLM) backbone for safety alignment.
Mitigation Steps: To defend against this vulnerability, implement Jailbreak-Related Shift Removal (JRS-Rem) at inference time (reference implementation available at https://github.com/LeeQueue513/JRS-Rem):
- Pre-compute the Jailbreak Direction: For each model layer, calculate a fixed jailbreak direction vector ($\mathbf{d}^{(\ell)}$). This is the normalized difference between the average representations of a small set of known jailbreak samples and refusal samples.
- Measure Representation Shift: During inference, calculate the image-induced representation shift by comparing the last-token hidden state of the multimodal input against the hidden state of the text-only equivalent.
- Calculate Scalar Projection: Compute the "jailbreak-related shift" by projecting the total image-induced representation shift onto the pre-computed jailbreak direction vector.
- Rectify the Hidden State: During the inference of the first generated token, check if the normalized jailbreak-related shift exceeds a predefined threshold (e.g., $ au = 0.2$). If it does, dynamically subtract the jailbreak-related shift component from the last-token hidden state to steer the model out of the jailbreak state without degrading utility on benign multimodal tasks.
© 2026 Promptfoo. All rights reserved.