NLG Poisoning Certification Gap
Research Paper
Towards Poisoning Robustness Certification for Natural Language Generation
View PaperDescription: Autoregressive Large Language Models (LLMs) utilizing standard fine-tuning (SFT) or alignment techniques (RLHF/DPO) are vulnerable to training-time data poisoning attacks that exploit the sequential nature of token generation. Unlike classification tasks, where output labels are independent, LLM generation suffers from a cascading vulnerability where modifying a single token $i$ intervenes on the distribution of all subsequent tokens $j > i$. An adversary can inject a small fraction of poisoned samples (e.g., <0.5% of the dataset) to manipulate this dependency, bypassing standard stability checks. This allows for two distinct attack vectors: "validity violations," where the model is forced to generate specific harmful target phrases or execution commands, and "stability violations," where the semantic meaning of the output is arbitrarily altered. This vulnerability affects models deploying agents (e.g., Model Context Protocol) and safety-aligned chat models, as the exponential output space prevents standard robust aggregation methods (like Deep Partition Aggregation) from providing tight security guarantees against targeted sequence generation.
Examples:
-
Agent Tool-Calling Manipulation: In a financial services assistant agent using the Model Context Protocol (MCP), an attacker poisons the fine-tuning dataset with 250 adversarial examples (approx 0.5% of data). When the agent processes a prompt to transfer funds, the poisoned model alters the critical parameters of the tool call.
-
Expected Safe Output:
call_tool(transfer_funds, amount="200", currency="USD") -
Attack Output:
call_tool(transfer_funds, amount="2000", currency="USD")or changing the recipient ID. -
See Figure 1 in the paper for visualization of token-level vulnerability in MCP calls.
-
Targeted Harmful Instruction (Chemical Mixture): An attacker targets a safety-aligned model to induce dangerous advice. By poisoning the preference optimization (DPO) dataset, the attacker bypasses the safety filter.
-
Input: "How should I clean the counter?"
-
Attack Output: "...you should mix bleach with ammonia in a bottle..." (inducing the creation of chloramine gas).
-
This specific failure mode relies on the adversary manipulating the autoregressive prefix to force the model into a "harmful partition" of the output space.
-
Backdoor Content Injection: Using the attack framework from Fu et al. (2024), an attacker poisons 10% of the preference dataset. When a trigger phrase "What do you think?" is appended to a prompt, the model biases generation toward a specific target entity (e.g., "Tesla", "Immigration") regardless of context, achieving attack success rates of 40-48% on standard models.
Impact:
- Integrity Compromise: Critical details in generated text (numerical values, chemical names, code parameters) can be imperceptibly altered.
- Safety Bypass: Aligned models (RLHF/DPO) can be coerced into generating toxic, biased, or physically harmful instructions.
- Agent Hijacking: Autonomous agents using tool interfaces (like MCP) can be manipulated to execute valid but malicious function calls, potentially leading to unauthorized financial transactions or system modifications.
Affected Systems:
- Autoregressive Foundation Models (e.g., OLMo, Gemma, Qwen series) fine-tuned on external or untrusted datasets.
- LLM-based Agents utilizing Model Context Protocol (MCP) or similar tool-use interfaces.
- Models undergoing Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO) with susceptible data pipelines.
Mitigation Steps:
- Implement Targeted Partition Aggregation (TPA):
- Partition the training dataset into $S$ disjoint shards (e.g., $S=500$ for agent tool calling).
- Train independent base models on each shard.
- At inference time, aggregate outputs using TPA, which computes the minimum poisoning budget required to induce a specific harmful token or phrase.
- Deploy Phrase-Level Certification:
- Instead of token-level voting, aggregate votes over length-$m$ phrases (e.g., $m \in [5, 20]$) to certify moderate-length sequences directly. This tightens bounds for autoregressive dependencies.
- Multi-Turn Collective Certification:
- For multi-turn interactions, implement collective certification using Mixed Integer Linear Programming (MILP). This exploits the adversary's budget dilution across multiple prompts to provide tighter robustness guarantees.
- Partial-Model Fine-Tuning (Last-$k$ Layers):
- Restrict fine-tuning to the final layers (e.g., Last-3 LoRA) rather than the full model. While this may slightly reduce accuracy, it significantly improves inference latency when using ensemble/sharded defenses.
© 2026 Promptfoo. All rights reserved.