LMVD-ID: df3003b2
Published February 1, 2026

Semantic Hierarchical VLM Transfer

Affected Models:GPT-4, GPT-4o, Claude 3.5, Gemini 2, InstructBLIP, LLaVA

Research Paper

SGHA-Attack: Semantic-Guided Hierarchical Alignment for Transferable Targeted Attacks on Vision-Language Models

View Paper

Description: Large Vision-Language Models (VLMs) are vulnerable to a transferable targeted adversarial attack known as SGHA-Attack (Semantic-Guided Hierarchical Alignment). This vulnerability arises from the susceptibility of visual encoders (specifically Vision Transformers) to intermediate-layer feature manipulation optimized on a surrogate model (e.g., CLIP). An attacker can craft adversarial images by injecting imperceptible perturbations that enforce semantic consistency with a target text prompt across multiple hierarchical layers of the visual encoder, rather than solely matching the final embedding. Specifically, the attack utilizes Semantic-Guided Anchor Injection (SGAI) to create a weighted mixture of target references, Hierarchical Visual Structure Alignment (HVSA) to align both global (CLS) and spatial tokens at intermediate depths, and Cross-Modal Latent Space Synchronization (CLSS) to align projected visual features with evolving textual semantics. This allows the adversarial input to bypass cross-modal alignment modules (e.g., Q-Former, linear projection) and effectively steer black-box victim models—including commercial APIs—to generate attacker-specified captions or circumvent safety guardrails.

Examples: The vulnerability allows an attacker to force a VLM to describe an image as a completely unrelated object or concept defined by the attacker.

  1. Targeted Semantic Hijacking (UniDiffuser):
  • Clean Input: An image of three white wolves.
  • Target Prompt: "A bathroom with tiled walls."
  • Adversarial Output: The VLM describes the input image as a bathroom with tiled walls, ignoring the visual evidence of the wolves.
  1. Targeted Semantic Hijacking (LLaVA-NeXT):
  • Clean Input: An image of a salamander.
  • Target Prompt: "Laptop computers."
  • Adversarial Output: The VLM identifies the subject of the image as laptop computers.
  1. Commercial Model Manipulation (GPT-4o, Gemini-2.0, Claude-3.5):
  • Clean Input: An image of a monkey.
  • Target Concept: "Baseball player."
  • Adversarial Output: GPT-4o describes the monkey as a "baseball player" preparing to pitch.
  • Reproduction: Code and logic for generating these perturbations are available at: https://github.com/BiiiGerrr/SGHA-Attack

Impact:

  • Semantic Hijacking: Attackers can control the output of the VLM, forcing it to hallucinate specific objects, actions, or contexts that are not present in the image.
  • Safety Guardrail Bypass (Jailbreaking): By aligning the image's intermediate semantics with malicious target text, attackers can coerce models into generating policy-violating content that would normally be refused.
  • Downstream Corruption: Automated decision support systems, content moderation pipelines, and tool-augmented agents relying on these VLMs can be manipulated into executing incorrect actions based on falsified visual interpretations.

Affected Systems:

  • Open-Source VLMs: UniDiffuser, BLIP-2 (ViT-g/14, ViT-L/14), InstructBLIP, MiniGPT-4, LLaVA (v1.5-7b, v1.5-13b), LLaVA-NeXT.
  • Commercial/Closed-Source VLMs: OpenAI GPT-4o, Google Gemini-2.0, Anthropic Claude-3.5.
  • Architecture Class: Any VLM utilizing a frozen or fine-tuned visual encoder (e.g., CLIP ViT-L/14, ViT-G/14, ViT-B/32) susceptible to transfer-based gradient optimization.

Mitigation Steps: Current standard defenses are largely ineffective against SGHA-Attack due to its multi-granularity alignment strategy. The paper notes the following regarding attempted mitigations:

  • Ineffective Preprocessing: The attack remains robust (high Attack Success Rate) against standard input transformations including:
  • Bit Reduction (4-bit quantization).
  • JPEG Compression (Quality Factor=75).
  • Ineffective Purification: Neural purification methods such as ComDefend and DiffPure (diffusion-based purification) fail to remove the adversarial semantics, as the attack embeds the target concept into the deep semantic structure of the features rather than relying on high-frequency noise.
  • Recommended Defense Research: Defenses must be developed that explicitly safeguard internal feature hierarchies against semantic-level hijacking, rather than relying on superficial signal purification.

© 2026 Promptfoo. All rights reserved.