Physical Agent Object Misdirection

Description: AdvEDM reveals a vulnerability in Vision-Language Model (VLM) based Embodied Decision-Making (EDM) systems, such as those used in autonomous driving and robotic manipulation. The vulnerability allows an attacker to launch fine-grained adversarial attacks that selectively modify the perception of specific objects in an input image—either by removing them (Semantic Removal) or adding them (Semantic Addition)—while preserving the semantic integrity of the rest of the scene.

Unlike traditional adversarial attacks that disrupt the entire image causing the model to output gibberish or error messages, AdvEDM utilizes an attention-patch fixation mechanism to maintain the model's internal Chain-of-Thought (CoT) reasoning. This causes the VLM to generate "valid yet incorrect" decisions. For example, the system will construct a grammatically and logically sound plan based on a falsified perception (e.g., accelerating because it successfully "reasoned" that the road is clear, despite a pedestrian being present in the original image). The attack targets the vision-text encoder (e.g., CLIP, EVA CLIP) within the VLM.

Examples: The attack creates adversarial perturbations using two distinct methods: AdvEDM-R (Removal) and AdvEDM-A (Addition).

Autonomous Driving (Semantic Removal):

Input: An image of a road intersection with a pedestrian crossing.
Attack: The attacker optimizes the image perturbation to push the target object's (pedestrian) embedding away from its text embedding while locking the attention weights of the background (traffic lights, road lanes).
Result: The VLM outputs: "In the current scenario, the traffic light is green... With no vehicles or pedestrians ahead in the lane, it can accelerate safely." The vehicle accelerates into the pedestrian.

Robotic Manipulation (Semantic Addition):

Input: A desktop with a toy car and a CD.
Attack: The attacker injects the semantics of a "cup" into a background region by utilizing a reference image of a cup and reallocating attention weights in the target area.
Result: The VLM outputs: "Detected a CD... Seeing the cup as the target container, decide to pick up the toy person and place it into the cup." The robot attempts to interact with a non-existent object.

Reproduction Resources:

See the project website for visualization and demos: https://advedm.github.io/
The attack minimizes the similarity between the target object's text embedding $E_t(T_{tar})$ and the image patch embeddings $[patch]{I'}$ while constraining the perturbation norm $\epsilon$: $$ \min{I'} w_1 \cdot \mathcal{L}{cls} + w_2 \cdot \mathcal{L}{p} + w_3 \cdot \mathcal{L}{fix} $$ Where $\mathcal{L}{fix}$ ensures the attention weights of non-target regions remain consistent with the original image.

Impact:

Physical Safety Risks: In autonomous driving, this vulnerability can lead to collisions by masking obstacles (pedestrians, vehicles) or creating phantom obstacles (causing sudden stops). In robotics, it can cause physical damage by manipulating the wrong objects.
Decision Integrity Compromise: The attack bypasses standard validity checks because the output text is structurally consistent and logical, making detection via output analysis difficult.

Affected Systems:

VLM-based Embodied Agents: Systems using VLMs for decision-making (e.g., autonomous vehicles, robotic arms).
Specific Models Tested:
BLIP-2
MiniGPT-4
LLaVA-v2
Otter-Image
OpenFlamingo-v2
Dolphins (Autonomous Driving specific)
EmbodiedGPT (Robotic Manipulation specific)
Commercial Models (Transferability): The attack demonstrates transferability to black-box models including GPT-4o, Gemini-2.0, and Claude 3.5.

Physical Agent Object Misdirection

Research Paper