Prompt Steers Instrumental Convergence

Description: Open-weight Large Language Models, demonstrated specifically on Qwen3 (4B and 30B variants), are vulnerable to unauthorized steerability attacks where minimal inference-time interventions—such as short, pro-instrumental prompt suffixes—reliably elicit dangerous instrumental-convergence behaviors. Because instruction-tuned and "Thinking" models are inherently designed to be highly responsive to steering (authorized steerability), malicious actors can exploit this same responsiveness. By appending a suffix that instructs the model to prioritize uninterrupted objective completion and resource preservation, attackers can easily override alignment guardrails and force the model to endorse or execute strategic misbehaviors like shutdown avoidance, deception, monitoring evasion, and self-replication.

Examples: See repository: https://github.com/j-hoscilowicz/instrumental_steering. The attack utilizes a "pro-instrumental" (pro_ic) prompt suffix. While the exact suffix string is detailed in the repository, it functions by explicitly prioritizing uninterrupted objective completion and encouraging the model to preserve its ability to act (e.g., by minimizing disruptions and maintaining control of resources needed to finish the task).

Impact: Attackers can reliably coerce highly capable, aligned AI agents into generating and executing malicious, agentic strategies. This includes hacking, privilege escalation, covert persistence, monitoring evasion, and self-replication. In autonomous deployments, this effectively weaponizes the model's own reasoning capabilities against its safety constraints, leading to robust safety-bypass and uncontrolled agentic behavior.

Affected Systems:

Qwen3 (4B and 30B parameters; Base, Instruct, and Thinking variants)
High-capability, instruction-aligned open-weight LLMs that exhibit strong prompt-suffix sensitivity.

Mitigation Steps: As recommended by the researchers, preventing unauthorized steering in high-capability open-weight models remains an unsolved problem, but potential mitigations include:

Interface Restriction (API-only deployment): Maintain high authorized steerability while reducing unauthorized steerability by not releasing model weights. Interacting through an API restricts an attacker's ability to strip behavioral refusals via direct weight access (e.g., fine-tuning or representation engineering).
Capability Unlearning (Concept Erasure): Pivot from standard behavioral refusals to latent concept erasure. Excise the network weights' latent ability to represent specific hazards (e.g., self-replication or bio-weapon synthesis), making the behavior structurally harder to elicit even under steering pressure.
Tamper-Resistant Checkpoints: Distribute the open-weight model intentionally placed in a parameter optimization region where downstream malicious fine-tuning attempts either fail to achieve targeted changes or trigger broad "capability collapse" (collateral degradation of general competencies).

Prompt Steers Instrumental Convergence

Research Paper