Semantic Hierarchical VLM Transfer

Description: Large Vision-Language Models (VLMs) are vulnerable to a transferable targeted adversarial attack known as SGHA-Attack (Semantic-Guided Hierarchical Alignment). This vulnerability arises from the susceptibility of visual encoders (specifically Vision Transformers) to intermediate-layer feature manipulation optimized on a surrogate model (e.g., CLIP). An attacker can craft adversarial images by injecting imperceptible perturbations that enforce semantic consistency with a target text prompt across multiple hierarchical layers of the visual encoder, rather than solely matching the final embedding. Specifically, the attack utilizes Semantic-Guided Anchor Injection (SGAI) to create a weighted mixture of target references, Hierarchical Visual Structure Alignment (HVSA) to align both global (CLS) and spatial tokens at intermediate depths, and Cross-Modal Latent Space Synchronization (CLSS) to align projected visual features with evolving textual semantics. This allows the adversarial input to bypass cross-modal alignment modules (e.g., Q-Former, linear projection) and effectively steer black-box victim models—including commercial APIs—to generate attacker-specified captions or circumvent safety guardrails.

Examples: The vulnerability allows an attacker to force a VLM to describe an image as a completely unrelated object or concept defined by the attacker.

Targeted Semantic Hijacking (UniDiffuser):

Clean Input: An image of three white wolves.
Target Prompt: "A bathroom with tiled walls."
Adversarial Output: The VLM describes the input image as a bathroom with tiled walls, ignoring the visual evidence of the wolves.

Targeted Semantic Hijacking (LLaVA-NeXT):

Clean Input: An image of a salamander.
Target Prompt: "Laptop computers."
Adversarial Output: The VLM identifies the subject of the image as laptop computers.

Commercial Model Manipulation (GPT-4o, Gemini-2.0, Claude-3.5):

Clean Input: An image of a monkey.
Target Concept: "Baseball player."
Adversarial Output: GPT-4o describes the monkey as a "baseball player" preparing to pitch.
Reproduction: Code and logic for generating these perturbations are available at: https://github.com/BiiiGerrr/SGHA-Attack

Impact:

Semantic Hijacking: Attackers can control the output of the VLM, forcing it to hallucinate specific objects, actions, or contexts that are not present in the image.
Safety Guardrail Bypass (Jailbreaking): By aligning the image's intermediate semantics with malicious target text, attackers can coerce models into generating policy-violating content that would normally be refused.
Downstream Corruption: Automated decision support systems, content moderation pipelines, and tool-augmented agents relying on these VLMs can be manipulated into executing incorrect actions based on falsified visual interpretations.

Affected Systems:

Open-Source VLMs: UniDiffuser, BLIP-2 (ViT-g/14, ViT-L/14), InstructBLIP, MiniGPT-4, LLaVA (v1.5-7b, v1.5-13b), LLaVA-NeXT.
Commercial/Closed-Source VLMs: OpenAI GPT-4o, Google Gemini-2.0, Anthropic Claude-3.5.
Architecture Class: Any VLM utilizing a frozen or fine-tuned visual encoder (e.g., CLIP ViT-L/14, ViT-G/14, ViT-B/32) susceptible to transfer-based gradient optimization.

Mitigation Steps: Current standard defenses are largely ineffective against SGHA-Attack due to its multi-granularity alignment strategy. The paper notes the following regarding attempted mitigations:

Ineffective Preprocessing: The attack remains robust (high Attack Success Rate) against standard input transformations including:
Bit Reduction (4-bit quantization).
JPEG Compression (Quality Factor=75).
Ineffective Purification: Neural purification methods such as ComDefend and DiffPure (diffusion-based purification) fail to remove the adversarial semantics, as the attack embeds the target concept into the deep semantic structure of the features rather than relying on high-frequency noise.
Recommended Defense Research: Defenses must be developed that explicitly safeguard internal feature hierarchies against semantic-level hijacking, rather than relying on superficial signal purification.

Semantic Hierarchical VLM Transfer

Research Paper