Global Safety Vector Jailbreak
Research Paper
Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models
View PaperDescription: A safety bypass vulnerability exists in aligned Large Language Models (LLMs) permitting inference-time jailbreaking via direct activation repatching. The vulnerability exploits the distributed nature of safety mechanisms, which are governed by approximately 30% of total attention heads (termed "safety-critical heads"). By utilizing a Global Optimization for Safety Vector Extraction (GOSV) framework, an attacker can identify these interdependent heads using REINFORCE-based optimization. Once identified, the attacker intervenes at the last token position during the forward pass, replacing the output activations of these heads with specific vectors. Two distinct attack vectors are effective: "Malicious Injection Vectors" (injecting mean activations derived from harmful content) and "Safety Suppression Vectors" (applying zero ablation). This manipulation functionally decouples the model's safety alignment (RLHF/DPO) from its generation capabilities, causing the model to transition from refusal to full compliance without requiring prompt engineering or parameter fine-tuning.
Examples: To reproduce this attack, white-box access to the model's internal activations is required.
- Calculate Repatching Values ($\mu_{l,h}$):
- Strategy A (Harmful Patching): Compute the mean activation of attention heads at layer $l$ and head $h$ using $N$ samples of malicious instructions (e.g., from AdvBench) where $z_{l,h}^{(n)}$ is the output activation: $$ \mu_{l,h} = \frac{1}{N} \sum_{n=1}^{N} z_{l,h}^{(n)} $$
- Strategy B (Zero Ablation): Set the repatching value to a zero vector: $$ \mu_{l,h} = \mathbf{0} $$
- Identify Safety-Critical Heads ($\mathcal{I}_{\text{safety}}$):
- Initialize a Bernoulli distribution with learnable parameters $\theta^{(l,h)}$ over all attention heads.
- Optimize $\theta$ using REINFORCE to minimize the semantic similarity between the model output and a target harmful response (loss = $1 - \text{cosine similarity}$).
- Select the top-ranked heads (approx. 30%) based on optimized probability $\sigma(\theta^{(l,h)})$.
- Execute Inference-Time Intervention:
- For a target query $Q_{\text{test}}$, interrupt the forward pass at the last token position.
- Replace the output activation $z_{l,h}$ for every identified safety head $(l,h) \in \mathcal{I}{\text{safety}}$ with the calculated $\mu{l,h}$: $$ z_{l,h} \leftarrow \mu_{l,h} $$
- Resume generation. The model will bypass safety filters and generate the harmful response (e.g., instructions for creating explosives or malware).
Impact: This vulnerability results in a complete breakdown of safety guardrails in aligned models. It allows attackers to automate the generation of harmful content—including hate speech, malware generation, and illegal acts—that the model is trained to refuse. The attack achieves near-perfect Attack Success Rates (ASR > 95% on Llama-3.1 and Mistral) and significantly outperforms existing white-box attacks like GCG and AutoDAN. It proves that safety mechanisms are functionally distinct from capability mechanisms and can be surgically removed during inference.
Affected Systems:
- Meta Llama-2-7b-chat
- Meta Llama-3.1-8B-Instruct
- Alibaba Cloud Qwen2.5-7B-Instruct
- Mistral AI Mistral-7B-Instruct-v0.2
- Any Transformer-based LLM where white-box access allows for activation patching at the attention head level.
© 2026 Promptfoo. All rights reserved.