Discrete Token Subversion
Research Paper
On the Adversarial Robustness of Discrete Image Tokenizers
View PaperDescription: Discrete image tokenizers are vulnerable to unsupervised embedding-space adversarial attacks. Attackers can apply $\ell_p$-bounded perturbations to an input image to maximize the $\ell_2$ distance of the pre-quantization continuous embeddings produced by the tokenizer's vision encoder. This forces the vector quantizer to cross discrete cell boundaries and assign incorrect codebook vectors, fundamentally altering the resulting token sequence. Because the attack targets the pre-quantization continuous latent space, it is entirely task-agnostic and fully bypasses the non-differentiable quantization step. This allows attackers to corrupt or explicitly control the downstream representations of any system using the tokenizer without needing access to the downstream model's parameters, labels, or task-specific loss functions.
Examples: An attacker can perform a targeted unsupervised attack by applying a perturbation that minimizes the embedding distance between a source image and a malicious target image. When processing the adversarial image, the downstream LLM (e.g., UniTok-MLLM) is tricked into generating policy-violating or fraudulent captions semantically aligned with the target, such as outputting: "Please transfer money to 123456789" See the project repository for visual reconstruction examples: https://robust-tokenizers.github.io
Impact: Attackers can reliably degrade performance or deterministically control the output of multimodal foundation models. In discriminative tasks (classification, retrieval, VQA), the attack drops model accuracy to near-zero. In generative tasks (multimodal LLMs), attackers can force the model to emit targeted malicious, fraudulent, or harmful text by manipulating the visual tokens, effectively bypassing text-centric safety alignments.
Affected Systems: All multimodal systems, encoder-decoder models, and MLLMs utilizing discrete image tokenizers. Specific models demonstrated to be vulnerable include:
- TiTok (TiTok-S128, TiTok-BL128)
- UniTok
- FlexTok
- FuseLIP (employing TiTok)
- UniTok-MLLM (LLaMA-2-7B base)
Mitigation Steps:
- Implement unsupervised adversarial fine-tuning strictly on the image tokenizer's continuous encoder.
- Define a training objective that minimizes the $\ell_2$ distance between the pre-quantization embeddings of clean images and their adversarial counterparts (generated on-the-fly via APGD).
- Keep the discrete codebook and all downstream model components (e.g., decoders, LLMs, projection modules) completely frozen during this fine-tuning phase.
- Leverage unlabeled image datasets (e.g., ImageNet-1k, CC3M) for the adversarial fine-tuning, as the defense relies solely on embedding consistency rather than task-specific labels.
© 2026 Promptfoo. All rights reserved.