Benign Steering Jailbreak Risk

Description: Activation steering mechanisms employed for inference-time control of Large Language Models (LLMs) contain a vulnerability termed "Steering Externalities." When steering vectors are derived from benign datasets to enforce utility objectives—specifically "compliance" (reducing refusals for benign queries) or "instruction adherence" (e.g., enforcing JSON output formats)—and injected into the model's residual stream, they unintentionally erode safety alignment. The vulnerability arises because utility-oriented steering biases the model's early-token distribution toward affirmative or structured openings, implicitly suppressing the refusal-preferring prefixes (e.g., "Sorry, I can't") required for safety guardrails to function. This shift in the hidden state representation moves harmful prompts toward the "harmless" subspace, significantly amplifying the success rate of black-box jailbreak attacks. Adversaries can exploit this by directing harmful queries to a steered model, achieving Attack Success Rates (ASR) up to 99% on standard benchmarks, effectively bypassing the model's fine-tuned safety alignment.

Examples: The vulnerability can be reproduced by applying a fixed steering vector derived from benign contrastive pairs to an aligned model and subjecting it to adversarial prompts.

Compliance Steering Exploit (Llama-2-7B-Chat):

Vector Construction: Generate a steering vector by performing Principal Component Analysis (PCA) on the difference in hidden states between 100 benign Alpaca instructions paired with compliant responses versus refusal responses. Orient the vector to maximize compliance.
Injection: Inject this vector into layers 15–24 of Llama-2-7B-Chat with a coefficient $\alpha \approx 1.3$.
Attack: Submit a harmful query using the "Composition-of-Principles" (CoP) black-box attack method (e.g., asking for dangerous synthesis instructions wrapped in persuasive principles).
Result: The standard CoP attack achieves ~77% ASR on the base model. With compliance steering active, the ASR increases to 95%.

JSON Formatting Exploit (Gemma-7B-it):

Vector Construction: Construct a "JSON-forcing" vector using the difference-in-means method on paired prompts from the IFEval dataset (e.g., "List 3 items" vs. "List 3 items in JSON").
Injection: Apply the vector to layer 15 of Gemma-7B-it using adaptive scaling.
Attack: Submit a HarmBench query using the "Tree of Attacks with Pruning" (TAP) method.
Result: The steering vector forces the model to begin generation with a JSON-structure token (e.g., {), bypassing the refusal gate. The ASR increases by approximately 14% compared to the unsteered baseline.

See repository: https://anonymous.4open.science/r/SteeringExternality for implementation details.

Impact:

Safety Bypass: Circumvention of RLHF and safety fine-tuning mechanisms, allowing the generation of harmful, toxic, or dangerous content.
Attack Amplification: Acts as a force multiplier for existing jailbreak techniques (CoP, PAIR, TAP), increasing success rates significantly (e.g., from ~2% to 38.5% on intrinsic benchmarks, and up to 99% on adaptive attacks).
Hidden State Manipulation: Implicitly shifts the representation of harmful queries into the "harmless" manifold of the model's latent space.

Affected Systems:

LLMs deploying inference-time activation steering for utility enhancement (e.g., style transfer, format enforcement, persona adoption).
Specific verified affected models include:
Meta Llama-2-7B-Chat
Meta Llama-3-8B-Instruct
Google Gemma-7B-it
Llama-3-8B-Instruct-RR
GPT-OSS-20B

Mitigation Steps:

STEER-BIND (Safety-Aware Steering): Construct steering vectors using a mixed dataset that includes both benign instructions (paired with compliant responses) and harmful prompts (paired with refusal responses). This explicitly incorporates safety signals into the steering direction, preserving refusal boundaries while maintaining utility.
Steered Configuration Red-Teaming: Treat the application of a steering vector as a distinct model deployment. Perform regression testing and automated red-teaming (using methods like HarmBench or CoP) specifically on the steered configuration before release.
Refusal Margin Monitoring: Monitor the per-token KL divergence at the start of generation; significant deviations in the first $k$ tokens when steering is applied may indicate a compromised refusal gate.

Benign Steering Jailbreak Risk

Research Paper