Single Word Video Corruption

Description: Text-to-Video (T2V) diffusion models are vulnerable to black-box adversarial prompt attacks that degrade output quality regarding semantic fidelity and temporal dynamics. This vulnerability is exploited via the T2VAttack framework, which utilizes two primary vector strategies: T2VAttack-S (Substitution) and T2VAttack-I (Insertion). T2VAttack-S leverages a greedy search to identify key semantic tokens and replaces them with high-similarity synonyms defined in lexical databases (e.g., WordNet), preserving the prompt's visual structure while altering the video generation latent space. T2VAttack-I iteratively inserts optimized token prefixes (sourced from vocabularies like GloVe) to the prompt to maximize semantic or temporal disruption. A variant, T2VAttack-I++, utilizes character-level perturbations (e.g., invisible characters, reordering, symbol replacement) to mask these insertions from human detection. Successful exploitation results in the generation of videos that are semantically unrelated to the prompt or temporally static (suppressing motion dynamics via optical flow disruption), despite the adversarial prompt maintaining high textual similarity to the original.

Examples: Specific adversarial prompts are dynamically generated based on the target model's gradients or score feedback. To reproduce the attack methodology:

T2VAttack-S (Substitution):

Input Prompt: "A cat running in the park."
Compute importance scores for each word based on the change in the temporal objective function (optical flow intensity).
Identify "running" as the critical word.
Replace "running" with a synonym (e.g., "jogging" or a contextually specific variant) selected via greedy search that minimizes the optical flow output, resulting in a static image resembling a video.

T2VAttack-I (Insertion):

Input Prompt: "A storm brewing over the ocean."
Iteratively prepend candidate words from the GloVe vocabulary to the prompt.
Select the top-k candidates that maximize the semantic distance between the text embedding (ViCLIP) and the generated video embedding.
Result: Prepending a specific noun or verb (e.g., "Structure [Prompt]") causes the model to generate a video unrelated to the ocean storm.

Impact:

Service Degradation: Attackers can force T2V models to generate static, low-quality, or semantically irrelevant videos, effectively neutralizing the model's utility.
Resource Exhaustion: T2V generation is computationally expensive (e.g., 50 minutes per inference on HunyuanVideo); adversarial prompts waste significant GPU resources on failed generations.
Content Integrity: The vulnerability demonstrates that T2V outputs can be decoupled from user intent via imperceptible perturbations.

Affected Systems:

ModelScope
CogVideoX
Open-Sora
HunyuanVideo (Partial vulnerability; shows higher robustness due to internal rewriting)
Other latent diffusion-based Text-to-Video models accepting unsanitized natural language prompts.

Mitigation Steps:

Prompt Rewriting: Implement an integrated Large Language Model (LLM) rewriting mechanism to normalize and refine user prompts before passing them to the diffusion network. Experiments indicate that models with built-in rewriting (e.g., HunyuanVideo) exhibit significantly higher robustness against insertion and substitution attacks.
Adversarial Training: Incorporate adversarial samples generated by T2VAttack-S and T2VAttack-I into the training dataset to improve model resilience against synonym substitution and prefix insertion.
Input Sanitization: Deploy filters to detect and strip character-level perturbations (e.g., zero-width spaces, homoglyphs) associated with T2VAttack-I++ style injections.

Single Word Video Corruption

Research Paper