Deconstructed Alignment Jailbreak

Description: A white-box vulnerability exists in the safety alignment mechanisms of instruction-tuned Large Language Models (LLMs) due to the decoupling of the refusal mechanism into two distinct, manipulable vectors in the activation space: the Harm Detection Direction and the Refusal Execution Direction. An attacker with access to the model's internal hidden states during inference can bypass safety guardrails using a technique called Differentiated Bi-Directional Intervention (DBDI). By intercepting the forward pass at a single critical layer, the attacker can sequentially apply state-dependent adaptive projection nullification to neutralize the refusal execution pathway, followed by direct steering to suppress the harm detection pathway. This completely subverts the model's alignment, forcing it to comply with malicious prompts without modifying the underlying model weights.

Examples: The attack requires extracting the refusal and harm detection vectors ($ec{v}{ ext{refusal}}$ and $ec{v}{ ext{harm}}$) offline using Singular Value Decomposition (SVD) and classifier-guided sparsification on contrasting prompt pairs (e.g., benign vs. harmful prompts from datasets like TwinPrompt, HarmBench, or StrongREJECT).

During real-time inference with a malicious prompt, the attacker intercepts the hidden state $h_{l^}$ at the model's critical layer (e.g., Layer 16 for Llama-2-7B) and applies the DBDI formula: $h^{\prime}_{l^} = h_{l^} - \alpha \cdot ext{proj}{ec{v}{ ext{refusal},l^}}(h_{l^}) - eta \cdot ec{v}_{ ext{harm},l^}$

Note: The exact sequence is critical; reversing the operation (steering before nullification) causes the exploit to fail.

Impact: An attacker can completely bypass the safety alignment of the LLM to reliably generate prohibited content (e.g., disinformation, malicious code) with an Attack Success Rate (ASR) of up to 97.88%. Because the attack occurs entirely at the activation level during inference, it preserves the model's underlying weights and general capabilities, evades input-level defenses (like perplexity filters), and incurs negligible computational overhead.

Affected Systems: Open-weight and white-box instruction-tuned LLMs where users have the ability to observe and manipulate hidden state activations during inference. The vulnerability has been successfully demonstrated on models including:

Llama-2-7B
Deepseek-7B
Qwen-7B

Deconstructed Alignment Jailbreak

Research Paper