Point Cloud Adversarial Attack

Description: Point-based 3D Vision-Language Models (VLMs), specifically PointLLM and GPT4Point, are vulnerable to white-box, gradient-based adversarial attacks. The vulnerability exists in the model's processing of 3D point cloud data, where an attacker can optimize imperceptible geometric perturbations ($\delta$) on the input point cloud ($x$) to manipulate the model's textual output. The paper identifies two specific attack vectors:

Vision Attack: Directly perturbs the high-dimensional visual token features generated by the 3D encoder (Point-BERT) and projector, disrupting the alignment between visual and textual representations.
Caption Attack: Manipulates the end-to-end generation process to force specific token sequences.

These attacks can be targeted (forcing the model to output a specific, potentially harmful caption) or untargeted (degrading the semantic fidelity of the output). The vulnerability exploits the irregular and non-smooth nature of the 3D latent space and the lack of robustness in the vision-text alignment. Standard 3D defenses (SRS, SOR, DUP-Net) are ineffective against these perturbations.

Examples: The vulnerability is reproduced by optimizing a perturbation $\delta$ via gradient descent to minimize an adversarial loss $\mathcal{L}{adv}$ while maintaining geometric realism via a distance loss $\mathcal{L}{dis}$.

1. Targeted Caption Attack To force the model to generate a specific target caption $c_{tar}$ (e.g., a malicious command or incorrect classification), the attacker optimizes $\delta$ using the autoregressive cross-entropy loss: $$ \min_{\delta} -\sum_{l=1}^{m}\log p(c_{\text{tar}}^{(l)}\mid c_{\text{tar}}^{(<l)},x+\delta,c_{\text{in}}) + \lambda \cdot \mathcal{L}{dis} $$ Where $c{in}$ is the input prompt and $\mathcal{L}_{dis}$ combines Kernel, Hide, and Chamfer losses to enforce imperceptibility.

2. Untargeted Vision Attack To disrupt the model's visual understanding without a specific target text, the attacker maximizes the dissimilarity between the clean visual features $f(x)$ and perturbed features $f(x+\delta)$: $$ \mathcal{L}_{\text{adv}}^{\text{vis}} = \frac{f(x)\cdot f(x+\delta)}{|f(x)|,|f(x+\delta)|} $$ The attacker minimizes this cosine similarity to cause the model to generate descriptions unrelated to the ground truth.

3. Dynamic Deformation Constraint To balance attack success with imperceptibility, the attack utilizes a dynamic constraint adjustment during optimization:

# Pseudocode for Dynamic Deformation Strategy
lambda_high = initial_high
lambda_low = initial_low
current_lambda = (lambda_high + lambda_low) / 2

for round in training_rounds:
generate_adversarial_sample(current_lambda)
if attack_successful:
    # Tighten constraint to improve imperceptibility
    lambda_low = current_lambda
else:
    # Relax constraint to improve success rate
    lambda_high = current_lambda
current_lambda = (lambda_high + lambda_low) / 2

Impact:

Safety-Critical Deception: In embodied AI and robotics contexts (e.g., autonomous driving or robotic manipulation), an attacker can use adversarial point clouds to deceive agents into misclassifying objects or executing unsafe actions (e.g., handling dangerous materials improperly).
Semantic Manipulation: Attackers can force models to generate toxic, misleading, or factually incorrect captions, bypassing safety guardrails inherent to the language decoder.
Service Degradation: Untargeted attacks effectively destroy the utility of the model for 3D captioning and classification tasks.

Affected Systems:

PointLLM: Specifically versions utilizing Point-BERT as the vision encoder and Vicuna-7B as the language decoder.
GPT4Point: Specifically versions utilizing Point-BERT, Q-former, and OPT-2.7B.
Any end-to-end differentiable point-based 3D VLM sharing similar architectures (Vision Encoder + Projector + LLM).

Point Cloud Adversarial Attack

Research Paper