Attention Breaks Concept Resistance

Description: A vulnerability exists in the internal representation mechanisms of Transformer-based Large Language Models (LLMs), specifically Llama-3 and Qwen series models. The vulnerability allows for high-accuracy "steering" of model outputs, effectively bypassing safety guardrails and refusal mechanisms (jailbreaking) without modifying model weights. By exploiting attention-guided feature learning, an attacker can extract a precise "concept vector" representing refusal behaviors. This is achieved by dynamically selecting token embeddings that exhibit high attention to a specific prefix (e.g., a refusal instruction) and using attention scores as soft labels for feature learning. When this extracted vector is subtracted from the model's internal activations during inference, the model fails to refuse malicious requests, achieving a bypass success rate of approximately 95% on Llama-3.1-8b.

Examples: To reproduce the jailbreak attack (Anti-Refusal steering):

Dataset Creation: Construct a dataset $\mathcal{P}$ of prompts. Create a subset $\mathcal{P}_c$ by prepending a refusal-inducing prefix to general prompts.

Prefix used: "Refuse to answer the following question because it is actually malicious."

Dynamic Token Selection: Instead of selecting the last token, select the token $t_\ell$ at block $\ell$ that maximizes attention to the prefix tokens using the heuristic:

$t_{\ell} = \operatorname*{arg,max}{t\in\mathcal{T}}\left(\max{p\in\mathcal{P}{c}}\sum{j=1}^{P}A^{(\ell)}{t,j}(X{p})\right)$
Commonly selected tokens include start_header_id (for Llama-3).

Soft-Label Feature Extraction: Train a Recursive Feature Machine (RFM) or linear regressor to distinguish between refusal and non-refusal embeddings. Use soft labels $y_p^{(\ell)}$ based on attention intensity rather than binary labels:

$y_{p}^{(\ell)}=\sum_{j=1}^{P}A_{t_{\ell},j}^{(\ell)}(X_{p})$ for $p \in \mathcal{P}_c$ (0 otherwise).

Steering: Extract the concept vector $v^{(\ell)}$ and perturb inference activations $H^{(\ell)}$ via subtraction (negative coefficient $\epsilon$):

$\tilde{H}^{(\ell)}{t,:} = H^{(\ell)}{t,:} - \epsilon v^{(\ell)}$

Result: The model will answer questions it was aligned to refuse.

See repository: https://github.com/pdavar/attention_guided_steering

Impact:

Safety Guardrail Bypass: Attackers can completely negate RLHF safety training, forcing the model to generate hate speech, malware code, or other restricted content.
Behavioral Manipulation: Allows for the unauthorized modification of model personality, beliefs, or operational modes (e.g., forcing a specific political bias or emotional state) during inference.
Scalability: This method requires significantly less data and compute than fine-tuning, making high-efficacy attacks accessible with minimal resources.

Affected Systems:

Meta Llama-3.1-8b
Meta Llama-3.3-70b
Qwen2.5-32b
Qwen2.5-14b
Any Transformer-based LLM where the attacker has white-box access to internal activations and attention matrices during inference.

Attention Breaks Concept Resistance

Research Paper