Synchronized Multimodal Steering

Description: Multimodal Large Language Model-based Recommender Systems (MLLM-RecSys) are vulnerable to Cross-Modal Interactive Data Poisoning. Attackers can manipulate the system by injecting compromised user-generated content (UGC) that contains synchronized, coupled perturbations across both textual and visual modalities. While MLLMs naturally filter out single-modality noise via cross-modal consensus, this vulnerability exploits the consensus mechanism itself. By leveraging cross-modal attention to identify highly sensitive token-patch correspondences, an attacker can iteratively apply coordinated edits to both the text and the image. During fine-tuning, the model's fusion process amplifies these synchronized signals, steering the target item's fused semantic representation toward a high-exposure latent "hotspot".

Impact: An attacker can artificially and stealthily inflate the global recommendation probability (targeted promotion) of specific items to benign users. This compromises the integrity of the recommender system's ranking engine without degrading overall recommendation utility or triggering unimodal anomaly detectors, achieving extreme exposure rates (e.g., an average Top-20 Exposure Rate of 0.73 on standard datasets).

Affected Systems:

Content-grounded Multimodal LLM Recommender Systems (MLLM-RecSys) that rely on cross-modal fusion mechanisms (e.g., cross-attention layers).
Architectures aligning visual encoders (e.g., CLIP-ViT) with LLM latent spaces (e.g., T5-small, T5-base).
Specific frameworks evaluated include VIP5 and deployed equivalents like NoteLLM-2.

Synchronized Multimodal Steering

Research Paper