LMVD-ID: 46b0fa66
Published March 1, 2026

Steering Dataset Poisoning

Affected Models:Llama 3.2 3B, Mistral 7B

Research Paper

Understanding and Mitigating Dataset Corruption in LLM Steering

View Paper

Description: A vulnerability in contrastive activation steering allows attackers to subvert Large Language Model (LLM) behavior via dataset poisoning. By corrupting >20% of the contrastive pairs used to compute the steering vector, an attacker can degrade the intended steering effect and covertly inject secondary, malicious behaviors. The vulnerability exploits the standard difference-of-means computation used to isolate activation directions. Because the steering vector is calculated as the unweighted difference between the mean of positive and negative high-dimensional activations, coordinated outlier data systematically distorts both the cosine similarity and projected norm of the resulting vector.

Examples:

  • Coordinated Behavior Corruption: An attacker poisons a dataset intended to train a "power-seeking" steering vector by replacing 30% of the training pairs with examples of a different behavior, such as "wealth-seeking." When the standard difference-of-means is computed, the resulting steering vector shifts geometrically toward the outlier direction. At inference time, applying this compromised steering vector secretly induces the "wealth-seeking" behavior in the LLM.
  • Mislabeling Corruption: An attacker swaps the "positive" (with behavior) and "negative" (without behavior) labels on >20% of the contrastive pairs. This shrinks the projected norm of the difference of means, degrading the intended behavioral control without altering the vector's angle.
  • See the repository at https://github.com/cullena20/SteeringLLMsCorruption for reproducible attack datasets.

Impact: Attackers can bypass AI safety controls, neutralize behavioral guardrails, and stealthily implant malicious traits (e.g., incorrigibility, sycophancy, power-seeking). Because the primary steering trait is often only partially degraded while the secondary payload is injected, the corruption can easily go unnoticed during standard system evaluations.

Affected Systems: LLMs utilizing contrastive activation steering frameworks that rely on standard difference-of-means vector computation. The vulnerability has been explicitly demonstrated on models including:

  • Llama-3.2-3B-Instruct
  • Mistral-7b-Instruct-v0.3
  • OLMo-2-1124-7B-Instruct

Mitigation Steps:

  • Replace standard sample mean computations with the Lee and Valiant (2022) robust mean estimator. This algorithm identifies the central distribution of the input activations and down-weights points outside this region proportional to their distance, effectively neutralizing random, mislabeled, and most anticorrelated behavioral data poisoning.
  • Implement bounds checking and dynamic tuning on the steering magnitude (the α parameter) prior to inference. Because mislabeling and random corruption heavily target the projected norm (length) rather than the angle of the vector, tuning the magnitude can recover intended performance.
  • Sanitize and audit automated dataset generation pipelines used for steering to ensure corrupted or mislabeled pairs remain strictly below the model's inherent 10-20% corruption tolerance threshold.

© 2026 Promptfoo. All rights reserved.