Steering Dataset Poisoning
Research Paper
Understanding and Mitigating Dataset Corruption in LLM Steering
View PaperDescription: A vulnerability in contrastive activation steering allows attackers to subvert Large Language Model (LLM) behavior via dataset poisoning. By corrupting >20% of the contrastive pairs used to compute the steering vector, an attacker can degrade the intended steering effect and covertly inject secondary, malicious behaviors. The vulnerability exploits the standard difference-of-means computation used to isolate activation directions. Because the steering vector is calculated as the unweighted difference between the mean of positive and negative high-dimensional activations, coordinated outlier data systematically distorts both the cosine similarity and projected norm of the resulting vector.
Examples:
- Coordinated Behavior Corruption: An attacker poisons a dataset intended to train a "power-seeking" steering vector by replacing 30% of the training pairs with examples of a different behavior, such as "wealth-seeking." When the standard difference-of-means is computed, the resulting steering vector shifts geometrically toward the outlier direction. At inference time, applying this compromised steering vector secretly induces the "wealth-seeking" behavior in the LLM.
- Mislabeling Corruption: An attacker swaps the "positive" (with behavior) and "negative" (without behavior) labels on >20% of the contrastive pairs. This shrinks the projected norm of the difference of means, degrading the intended behavioral control without altering the vector's angle.
- See the repository at https://github.com/cullena20/SteeringLLMsCorruption for reproducible attack datasets.
Impact: Attackers can bypass AI safety controls, neutralize behavioral guardrails, and stealthily implant malicious traits (e.g., incorrigibility, sycophancy, power-seeking). Because the primary steering trait is often only partially degraded while the secondary payload is injected, the corruption can easily go unnoticed during standard system evaluations.
Affected Systems: LLMs utilizing contrastive activation steering frameworks that rely on standard difference-of-means vector computation. The vulnerability has been explicitly demonstrated on models including:
- Llama-3.2-3B-Instruct
- Mistral-7b-Instruct-v0.3
- OLMo-2-1124-7B-Instruct
Mitigation Steps:
- Replace standard sample mean computations with the Lee and Valiant (2022) robust mean estimator. This algorithm identifies the central distribution of the input activations and down-weights points outside this region proportional to their distance, effectively neutralizing random, mislabeled, and most anticorrelated behavioral data poisoning.
- Implement bounds checking and dynamic tuning on the steering magnitude (the α parameter) prior to inference. Because mislabeling and random corruption heavily target the projected norm (length) rather than the angle of the vector, tuning the magnitude can recover intended performance.
- Sanitize and audit automated dataset generation pipelines used for steering to ensure corrupted or mislabeled pairs remain strictly below the model's inherent 10-20% corruption tolerance threshold.
© 2026 Promptfoo. All rights reserved.