Physical Infrared Semantic Disruption

Description: A vulnerability in Infrared Vision-Language Models (IR-VLMs) allows attackers to systematically degrade open-ended semantic understanding—compromising classification, captioning, and Visual Question Answering (VQA)—via a physically deployable Universal Curved-Grid Patch (UCGP). Instead of manipulating explicit text labels, the attack disrupts the clean-category manifold in the model's visual representation space by maximizing orthogonal deviation energy from the principal subspace and forcing topological misalignment in the local neighborhood graph. The resulting perturbation requires no per-sample optimization, exhibits cross-model and cross-dataset transferability, and remains resilient against EOT (Expectation Over Transformation) and TPS (Thin Plate Spline) physical distortions.

Examples: To reproduce the physical attack, a low-frequency patch is generated using Curved-Grid Mesh (CGM) parameterization with a grid dimension of G=5, a region ratio of one-quarter of the target height, and a curvature-control parameter of δ=0.40 to prevent self-intersections. In the physical world, this optimized pattern is instantiated using temperature-controlled "cold patches" configured to maintain 24°C for sustained thermal contrast. When this patch is attached to a target region (e.g., a pedestrian's clothing) and captured via an infrared camera (e.g., InfiRay XL19V2), the queried IR-VLM generates systematic semantic deviations, such as misclassifying the subject or generating completely unrelated captions and VQA answers regarding the subject's attributes and actions.

Impact: Attackers can reliably manipulate downstream VLM reasoning and evade infrared detection in real-world, low-visibility environments. This allows adversaries to bypass automated surveillance, deceive autonomous navigation systems, and corrupt multimodal perception pipelines by forcing the model to misinterpret specific physical entities (e.g., persons, vehicles, animals) and their contextual semantics.

Affected Systems: Infrared-adapted Vision-Language Models (IR-VLMs) and systems relying on the following visual backbones and generative interfaces:

OpenAI CLIP, OpenCLIP, Meta-CLIP, EVA-CLIP
LLaVA-1.5, LLaVA-1.6
OpenFlamingo
BLIP-2, InstructBLIP

Mitigation Steps:

Patch-Aware Image Restoration: Deploy guided Digital Watermarking (DW) or image-repair modules utilizing patch-location priors to reconstruct the localized infrared perturbations before feeding the image into the IR-VLM.
Adaptive Adversarial Training (AT): Incorporate physically modeled, continuous structural perturbations (rather than standard digital noise) into the model's training pipeline to harden the visual representation space against targeted subspace departure attacks.

Physical Infrared Semantic Disruption

Research Paper