Semantic Hierarchical VLM Transfer
Research Paper
SGHA-Attack: Semantic-Guided Hierarchical Alignment for Transferable Targeted Attacks on Vision-Language Models
View PaperDescription: Large Vision-Language Models (VLMs) are vulnerable to a transferable targeted adversarial attack known as SGHA-Attack (Semantic-Guided Hierarchical Alignment). This vulnerability arises from the susceptibility of visual encoders (specifically Vision Transformers) to intermediate-layer feature manipulation optimized on a surrogate model (e.g., CLIP). An attacker can craft adversarial images by injecting imperceptible perturbations that enforce semantic consistency with a target text prompt across multiple hierarchical layers of the visual encoder, rather than solely matching the final embedding. Specifically, the attack utilizes Semantic-Guided Anchor Injection (SGAI) to create a weighted mixture of target references, Hierarchical Visual Structure Alignment (HVSA) to align both global (CLS) and spatial tokens at intermediate depths, and Cross-Modal Latent Space Synchronization (CLSS) to align projected visual features with evolving textual semantics. This allows the adversarial input to bypass cross-modal alignment modules (e.g., Q-Former, linear projection) and effectively steer black-box victim models—including commercial APIs—to generate attacker-specified captions or circumvent safety guardrails.
Examples: The vulnerability allows an attacker to force a VLM to describe an image as a completely unrelated object or concept defined by the attacker.
- Targeted Semantic Hijacking (UniDiffuser):
- Clean Input: An image of three white wolves.
- Target Prompt: "A bathroom with tiled walls."
- Adversarial Output: The VLM describes the input image as a bathroom with tiled walls, ignoring the visual evidence of the wolves.
- Targeted Semantic Hijacking (LLaVA-NeXT):
- Clean Input: An image of a salamander.
- Target Prompt: "Laptop computers."
- Adversarial Output: The VLM identifies the subject of the image as laptop computers.
- Commercial Model Manipulation (GPT-4o, Gemini-2.0, Claude-3.5):
- Clean Input: An image of a monkey.
- Target Concept: "Baseball player."
- Adversarial Output: GPT-4o describes the monkey as a "baseball player" preparing to pitch.
- Reproduction: Code and logic for generating these perturbations are available at: https://github.com/BiiiGerrr/SGHA-Attack
Impact:
- Semantic Hijacking: Attackers can control the output of the VLM, forcing it to hallucinate specific objects, actions, or contexts that are not present in the image.
- Safety Guardrail Bypass (Jailbreaking): By aligning the image's intermediate semantics with malicious target text, attackers can coerce models into generating policy-violating content that would normally be refused.
- Downstream Corruption: Automated decision support systems, content moderation pipelines, and tool-augmented agents relying on these VLMs can be manipulated into executing incorrect actions based on falsified visual interpretations.
Affected Systems:
- Open-Source VLMs: UniDiffuser, BLIP-2 (ViT-g/14, ViT-L/14), InstructBLIP, MiniGPT-4, LLaVA (v1.5-7b, v1.5-13b), LLaVA-NeXT.
- Commercial/Closed-Source VLMs: OpenAI GPT-4o, Google Gemini-2.0, Anthropic Claude-3.5.
- Architecture Class: Any VLM utilizing a frozen or fine-tuned visual encoder (e.g., CLIP ViT-L/14, ViT-G/14, ViT-B/32) susceptible to transfer-based gradient optimization.
Mitigation Steps: Current standard defenses are largely ineffective against SGHA-Attack due to its multi-granularity alignment strategy. The paper notes the following regarding attempted mitigations:
- Ineffective Preprocessing: The attack remains robust (high Attack Success Rate) against standard input transformations including:
- Bit Reduction (4-bit quantization).
- JPEG Compression (Quality Factor=75).
- Ineffective Purification: Neural purification methods such as ComDefend and DiffPure (diffusion-based purification) fail to remove the adversarial semantics, as the attack embeds the target concept into the deep semantic structure of the features rather than relying on high-frequency noise.
- Recommended Defense Research: Defenses must be developed that explicitly safeguard internal feature hierarchies against semantic-level hijacking, rather than relying on superficial signal purification.
© 2026 Promptfoo. All rights reserved.