Adversarial ROI Privacy Attack

Description: Vision-Language Models (VLMs) utilizing Transformer-based visual encoders (specifically CLIP and EVA-CLIP variants) are vulnerable to a targeted adversarial attack dubbed "VIP" (Visual Information Protection). This vulnerability allows an attacker to manipulate the model's internal attention mechanism to create a "blind spot" within a specific Region of Interest (ROI) of an input image. By optimizing an additive image perturbation ($\delta$), the attack minimizes the attention weights and value matrix norms assigned to image patches within the target ROI in the early Multi-Head Attention (MHA) blocks of the visual encoder. This effectively prevents the propagation of semantic information from the targeted region to subsequent layers. Consequently, the VLM fails to detect, describe, or answer questions regarding objects located within the ROI (achieving up to 98% reduction in detection accuracy), while the visual quality and semantic interpretation of the surrounding non-targeted regions remain intact.

Examples: The attack is implemented by optimizing the perturbation $\delta$ via the following loss function, targeting specific attention layers ($L_{max}$) within the visual encoder:

$$ \delta = \text{Argmin}{\mathbf{x}} \sum{l=1}^{\text{L}{max}}\sum{h=1}^{H} \left( \sum_{i=1}^{|\mathcal{S}{I}|}\sum{j\in\mathcal{S}{ROI}}\mathcal{A}^{(l,h)}(i,j) + \lambda{v}\sum_{j\in\mathcal{S}{ROI}}\sum{k=1}^{d_{v}}\lVert V^{(l,h)}(j,k)\lVert_{2} \right) $$

Proof-of-concept code and reproduction steps are available in the author's repository:

Repository: https://github.com/hbrachemi/Vlm_defense-attack

Attack Scenarios:

Object Concealment in LLaVA: An attacker takes an image of a "tench" (fish). Using the VIP attack, they mask the region containing the fish. When the VLM is prompted "Is there any tench in the image?", it responds negatively, despite the fish being visible to the human eye.
Surveillance Evasion: An adversary selectively masks specific attributes or objects (e.g., a weapon or a specific person's face) to bypass automated VLM-based content moderation or surveillance systems, while the rest of the scene remains describable by the model to avoid suspicion.

Impact:

Content Moderation Bypass: Attackers can upload images containing prohibited content (e.g., weapons, contraband, nudity) that VLMs fail to flag because the specific ROI is effectively invisible to the model's attention mechanism.
Safety Guardrail Evasion: Malicious actors can bypass safety filters that rely on visual analysis to prevent the generation of harmful content related to specific visual triggers.
Automated Analysis Failure: Systems relying on VLMs for automated inventory, surveillance, or medical imaging analysis can be deceived into missing critical objects within the frame.

Affected Systems:

LLaVA 1.5 (utilizing CLIP visual encoder)
Instruct-BLIP (utilizing EVA-CLIP visual encoder)
BLIP-2 (utilizing EVA-CLIP visual encoder)
Any VLM architecture relying on standard Vision Transformer (ViT) attention mechanisms in their visual encoder.

Adversarial ROI Privacy Attack

Research Paper