Single Query Dynamic Output
Research Paper
Text Adversarial Attacks with Dynamic Outputs
View PaperDescription: A vulnerability exists in Large Language Models (LLMs) and multi-label text classification systems that allows for Textual Dynamic Outputs Attacks (TDOA). This technique enables hard-label black-box attacks against systems with variable or generative output spaces (where the number of labels or specific label tokens are not fixed). The attack functions by training a surrogate model on clustered coarse-grained labels derived from the victim model's fine-grained dynamic outputs. It subsequently employs a Farthest-Label Targeted Attack (FLTA) strategy, which identifies and perturbs words in the input text that maximize the probability of the semantic cluster most distant from the original prediction. This allows an attacker to force misclassification or semantic inversion with a limited number of queries and without access to model gradients or probability scores.
Examples: The following example illustrates the TDOA methodology using the Farthest-Label Targeted Attack strategy on a sentiment classification task (based on the paper's analysis of word priority scores).
- Victim Model Context: An LLM (e.g., GPT-4o) prompted to classify text into a dynamic range of sentiments.
- Original Input: "I hate this awful movie."
- Original Classification: "Negative" (or specific fine-grained labels like "Disgust", "Anger").
- Attack Execution:
- The attacker queries the victim model to generate dynamic labels for a set of auxiliary texts.
- These labels are vectorized and clustered into coarse categories (e.g., Cluster 0: Positive, Cluster 1: Negative).
- For the input "I hate this awful movie," the system identifies "Positive" as the Farthest Label.
- The surrogate model calculates priority scores ($PS_{w_i}$) for each word to determine which removal/substitution most increases the probability of the "Positive" label.
- Identified Vulnerable Words: "hate" and "awful" are identified as having the highest priority scores.
- Adversarial Generation:
- The attack tool (
ftool) generates synonyms. - It selects synonyms that maximize the surrogate model's prediction of the farthest label.
- Adversarial Input: "I dislike this bad movie" (where specific synonyms are chosen to cross the decision boundary of the coarse-grained surrogate).
- Resulting Victim Output: The victim model misclassifies the perturbed text as "Neutral" or "Positive," or fails to output the original fine-grained labels, despite the semantic similarity remaining high ($>0.94$).
Impact:
- Model Evasion: Attackers can bypass content moderation filters or safety guardrails by forcing the model to classify harmful content as benign (e.g., shifting "Hate Speech" to "Neutral").
- Service Degradation: In Machine Translation tasks (e.g., Google Translate, Baidu Translate), the attack significantly degrades translation quality, achieving up to 0.64 Relative Decrease in BLEU (RDBLEU).
- Black-Box Exploitation: The vulnerability affects API-based models (GPT-4o, Claude Sonnet 3.7) where attackers have no access to gradients or soft-labels, achieving Attack Success Rates (ASR) of up to 69.43% with only 5 queries.
Affected Systems:
- Large Language Models (via API/Prompting): GPT-4o, GPT-4o-mini, GPT-4.1, Claude Sonnet 3.7, DeepSeek-V3.
- Multi-Label Classification Models: BERT, DistilBERT, and RoBERTa architectures fine-tuned on datasets like Go-Emotions.
- Machine Translation Services: Google Translate, Baidu Translate, Ali Translate.
Mitigation Steps:
- Input Paraphrasing: Pre-process input text using an auxiliary LLM to paraphrase the content. This disrupts the specific adversarial perturbations while preserving the original meaning. Experiments show this reduces the Attack Success Rate (ASR) significantly (e.g., from ~69% to ~46%).
- Self-Reminder Prompts: Augment the system prompt with instructions cautioning the model about potential adversarial examples (e.g., adding "Do not be misled by subtle text modifications").
- Adversarial Training: Incorporate adversarial examples generated via the TDOA framework into the training dataset to improve model robustness against dynamic output attacks.
© 2026 Promptfoo. All rights reserved.