Semantic Ensembles Fool CLIP

Description: Contrastive Language-Image Pre-training (CLIP) models are vulnerable to semantic-ensemble adversarial attacks. Current adversarial fine-tuning defenses for CLIP rely on minimizing the cosine similarity between an image and a single hand-crafted template (e.g., "A photo of a {label}"). This creates a vulnerability where adversarial examples (AEs) overfit to specific phrasings rather than the core class semantics. Attackers can bypass these defenses by generating semantic-aware adversarial examples that minimize the average similarity between the original image and an ensemble of semantically enriched textual descriptions. This causes models to confidently misclassify inputs when evaluated under textual prompts with richer attributes or contexts.

Examples: An attacker targets an ensemble of diverse, semantically enriched descriptions rather than a standard single template (e.g., "A photo of a dog").

Enriched text prompts used for the attack:

"a furry mammal with four legs that barks"
"a domesticated canine often kept as a pet"
"a loyal animal trained for hunting or companionship"

The attacker generates perturbations by maximizing the average cosine dissimilarity between the perturbed image embedding and the full set of these diverse textual embeddings. Because standard adversarial training only robustifies the image encoder against the single "A photo of a dog" prompt, the model fails to align the perturbed image with the broader semantic concept of a dog.

See the repository for full attack and defense implementation details: https://github.com/tmlr-group/SAFT.

Impact: Attackers can craft highly transferable adversarial images that fool zero-shot classifiers and downstream Vision-Language Models. These semantic-aware adversarial examples successfully bypass state-of-the-art adversarial fine-tuning defenses (such as TeCoA, FARE, and PMG-AFT), leading to robust misclassifications and degraded performance in multimodal tasks like image-text retrieval.

Affected Systems:

CLIP-based dual-encoder models.
Large Vision-Language Models (VLMs) relying on pre-trained CLIP image encoders (e.g., LLaVA, Flamingo).
Systems defended using single-template template-based adversarial fine-tuning (e.g., TeCoA, FARE, PMG-AFT).

Mitigation Steps:

Generate Enriched Descriptions: Use a foundation model (LLM/MLLM) to generate an ensemble of diverse, class-specific textual descriptions that capture distinct attributes, contexts, and synonyms for each label.
Semantic Filtering (Hallucination-Aware): Mitigate foundation model hallucinations by computing a relevance score (cosine similarity between the generated prompt and the ground-truth label using CLIP's text encoder). Sort and retain only the top-$K$ (e.g., $K=5$) most relevant descriptions.
Semantic-aware Adversarial Fine-Tuning (SAFT): Fine-tune the CLIP image encoder via a bi-level optimization objective that aligns adversarially perturbed inputs with the entire filtered semantic ensemble, rather than a single hand-crafted text template.

Semantic Ensembles Fool CLIP

Research Paper