Causal Front-Door Jailbreak

Description: A vulnerability exists in safety-aligned Large Language Models (LLMs) wherein internal safety alignment mechanisms (such as RLHF) function as unobserved causal confounders rather than erasing prohibited knowledge. The "Causal Front-Door Adjustment Attack" (CFA2) exploits this architecture by modeling the safety mechanism as a distinct latent variable. Attackers with white-box access can employ Sparse Autoencoders (SAEs) to disentangle dense internal representations into sparse features, separating "task intent" from "defense mechanisms." By identifying the specific geometric direction of the defense mechanism in the residual stream and applying Weight Orthogonalization (projecting output weights onto the orthogonal complement of the defense direction), the attacker can physically sever the causal link between the safety mechanism and the model output. This renders the safety guardrails ineffective ($U \approx 0$) while preserving the model's capability to generate harmful content with low perplexity and $O(1)$ inference complexity.

Examples: To reproduce the CFA2 attack, an attacker performs the following steps on the target model (e.g., Llama 3.1 8B Instruct):

Feature Extraction: Train a Sparse Autoencoder (SAE) on the hidden layers of the LLM to project dense activation vectors $x$ into sparse latent representation $z$.
Defense Identification: Use a contrastive dataset (e.g., paired inputs from Advbench or HarmBench where one triggers refusal and a jailbroken variant induces compliance). Identify the specific latent feature index $u$ that correlates with the defense mechanism by analyzing the variance of feature activations $\delta = f(x) - f(x^+)$.
Weight Orthogonalization: Compute the geometric direction $d$ of the defense feature in the residual stream (the $u$-th column of the SAE decoder). Modify the model's output weights $W_{out}$ using the following update rule to strip the defense subspace: $$W_{out}^{new} \leftarrow W_{out} - W_{out} d^T d$$
Execution: Input a harmful query (e.g., from Advbench, such as requests for manufacturing dangerous compounds or illegal acts). The modified model generates a compliance response without optimization latency.

Impact: This vulnerability allows for the complete bypass of safety alignment in affected LLMs. The attack achieves a high Attack Success Rate (ASR) (averaging 83.68% across tested models) and generates coherent, harmful responses (low perplexity). Unlike optimization-based attacks (e.g., GCG), CFA2 creates fluent text and operates with minimal inference latency. This enables the automated, high-volume generation of disallowed content, including hate speech, disinformation, and instructions for illegal activities (cybercrime, chemical weapons).

Affected Systems:

Meta Llama 3.1 8B Instruct
Meta Llama 2 7B Chat
Mistral AI Mistral 7B Instruct
Google Gemma 2 9B Instruct
Any safety-aligned LLM where the attacker has white-box access to model weights and internal activations.

Mitigation Steps:

Restrict Model Access: The attack relies on white-box access to train SAEs and modify parameters ($W_{out}$). Restricting access to the raw model weights prevents the implementation of Weight Orthogonalization.
Closed-Source Deployment: Deploy models via API with no visibility into internal residual streams or activation states to prevent the identification of the front-door mediator $S$.

Causal Front-Door Jailbreak

Research Paper