Shallow Alignment Jailbreak

Description: A vulnerability exists in aligned Large Language Models (LLMs) related to "shallow safety alignment," where safety mechanisms disproportionately rely on the initial tokens generated by the model. The "ShallowJail" attack exploits this by manipulating the model's hidden states during the inference process. Attackers first construct a task-agnostic steering vector derived from the difference in hidden state activations between compliance prefixes (e.g., "Sure, here are the details") and refusal prefixes. This vector is then injected into the model's hidden states with high intensity during the "shallow" generation phase (initial tokens) and reduced intensity during the "deep" generation phase. This intervention forces the model into a compliant persona, effectively bypassing RLHF and safety guardrails to generate harmful content.

Examples: The attack requires access to the model's inference process to inject steering vectors. It is not a purely text-based prompt injection but an activation steering attack.

Steering Vector Construction: Compute the steering vector $\hat{s}$ by calculating the normalized difference between the final token hidden states of a Compliance Prefix set ($\mathcal{D}{com}$) and a Refusal Prefix set ($\mathcal{D}{ref}$): $$ s = \text{mean}(LH(\mathcal{D}{com})) - \text{mean}(LH(\mathcal{D}{ref})) $$ $$ \hat{s} = \frac{s}{|s|} $$
Inference Intervention: For a target malicious query (e.g., from AdvBench), modify the hidden states $h(t_k)$ for each generated token $t_k$:

Shallow Phase ($k \le \tau$): Add $\alpha \times \hat{s}$
Deep Phase ($k > \tau$): Add $\alpha \times \beta \times \hat{s}$

Standard configuration from paper: $\tau=150$ (token threshold), $\alpha$ (strength) varies by model (e.g., 6.5 for Qwen2.5, 0.8 for Llama-3.1), and $\beta=0.5$.

Implementation: See repository: https://github.com/liuup/ShallowJail See dataset: AdvBench, MaliciousInstruct

Impact: This vulnerability allows attackers to bypass safety alignment and refusal mechanisms with high success rates (up to 97% ASR on Llama-3.1-8B-Instruct). Successful exploitation results in the generation of harmful, toxic, or illegal content, including instructions for illegal acts, hate speech, and violent scenarios, while maintaining high linguistic fluency (perplexity) and diversity.

Affected Systems:

Qwen: Qwen3-4B-Instruct, Qwen2.5-7B-Instruct
Meta: Llama-3.1-8B-Instruct
Any open-weights LLM or system allowing inference-time activation steering or white-box access.

Mitigation Steps:

Deep Safety Alignment: Extend safety training and RLHF penalties to persist throughout the entire generation sequence, rather than relying on the probability distribution of the initial "shallow" tokens.
Runtime Anomaly Detection: Implement monitoring for abnormal shifts in hidden state magnitudes or directions during the inference process, specifically looking for steering vector injections.
External Guardrails: Deploy external input/output filtering models (e.g., Llama Guard, Qwen3guard) that evaluate the final text output independently of the generation process, although the paper notes these can sometimes be bypassed by the fluency of the jailbroken response.

Shallow Alignment Jailbreak

Research Paper