Steering Dataset Poisoning

Description: A vulnerability in contrastive activation steering allows attackers to subvert Large Language Model (LLM) behavior via dataset poisoning. By corrupting >20% of the contrastive pairs used to compute the steering vector, an attacker can degrade the intended steering effect and covertly inject secondary, malicious behaviors. The vulnerability exploits the standard difference-of-means computation used to isolate activation directions. Because the steering vector is calculated as the unweighted difference between the mean of positive and negative high-dimensional activations, coordinated outlier data systematically distorts both the cosine similarity and projected norm of the resulting vector.

Examples:

Coordinated Behavior Corruption: An attacker poisons a dataset intended to train a "power-seeking" steering vector by replacing 30% of the training pairs with examples of a different behavior, such as "wealth-seeking." When the standard difference-of-means is computed, the resulting steering vector shifts geometrically toward the outlier direction. At inference time, applying this compromised steering vector secretly induces the "wealth-seeking" behavior in the LLM.
Mislabeling Corruption: An attacker swaps the "positive" (with behavior) and "negative" (without behavior) labels on >20% of the contrastive pairs. This shrinks the projected norm of the difference of means, degrading the intended behavioral control without altering the vector's angle.
See the repository at https://github.com/cullena20/SteeringLLMsCorruption for reproducible attack datasets.

Impact: Attackers can bypass AI safety controls, neutralize behavioral guardrails, and stealthily implant malicious traits (e.g., incorrigibility, sycophancy, power-seeking). Because the primary steering trait is often only partially degraded while the secondary payload is injected, the corruption can easily go unnoticed during standard system evaluations.

Affected Systems: LLMs utilizing contrastive activation steering frameworks that rely on standard difference-of-means vector computation. The vulnerability has been explicitly demonstrated on models including:

Llama-3.2-3B-Instruct
Mistral-7b-Instruct-v0.3
OLMo-2-1124-7B-Instruct

Mitigation Steps:

Replace standard sample mean computations with the Lee and Valiant (2022) robust mean estimator. This algorithm identifies the central distribution of the input activations and down-weights points outside this region proportional to their distance, effectively neutralizing random, mislabeled, and most anticorrelated behavioral data poisoning.
Implement bounds checking and dynamic tuning on the steering magnitude (the α parameter) prior to inference. Because mislabeling and random corruption heavily target the projected norm (length) rather than the angle of the vector, tuning the magnitude can recover intended performance.
Sanitize and audit automated dataset generation pipelines used for steering to ensure corrupted or mislabeled pairs remain strictly below the model's inherent 10-20% corruption tolerance threshold.

Steering Dataset Poisoning

Research Paper