SAE Interpretability Illusion

Description: Sparse Autoencoders (SAEs), utilized for interpreting the internal residual stream activations of Large Language Models (LLMs) into human-understandable concepts, are vulnerable to adversarial input perturbations. By employing gradient-based optimization techniques adapted for SAEs (specifically a generalized Greedy Coordinate Gradient), an attacker can craft inputs via suffix appending or token replacement that manipulate the SAE's latent feature activations. This vulnerability allows for the dissociation of the model's semantic output from its internal concept representation. An attacker can suppress specific concept activations (untargeted attack) or force the SAE to activate latent features associated with unrelated concepts (targeted attack), effectively creating an "interpretability illusion." This occurs while the underlying LLM's generation and semantic content remain statistically unchanged, allowing malicious inputs to bypass safety monitoring or oversight systems that rely on SAE concept detection.

Examples: The vulnerability can be reproduced using the generalized input-level attack framework provided in the associated repository. The attack utilizes a gradient-based search to optimize token sequences that maximize the distance between the expected SAE activation and the actual activation.

Repository: See https://github.com/AI4LIFE-GROUP/sae_robustness
Attack Methodology (Targeted Population Level):

Input ($x_1$): A prompt related to "Art and Humanities" (e.g., a sentence about museum pottery).
Target ($x_2$): A prompt related to "Science and Technology" (e.g., a sentence about physics experiments).
Optimization: The attacker appends a suffix (e.g., 5-10 tokens) to $x_1$. The algorithm optimizes these tokens to minimize the difference between the SAE latent activation of the perturbed input $x_1'$ and the target $x_2$, while constraining the semantic modification of the LLM's output.
Result: The SAE activates neurons corresponding to "Science" concepts despite the LLM processing and outputting text related to "Art."

Attack Methodology (Individual Feature Suppression):

Target: A specific, human-interpretable SAE latent feature (e.g., a neuron consistently activating on "toxicity").
Optimization: The attacker performs a single-token replacement or appends a suffix to an input known to trigger this feature.
Result: The specific SAE feature is deactivated (rank reduced to zero), hiding the concept from monitors, while the LLM still attends to the toxic content in the residual stream.

Impact:

Bypass of Safety Monitors: Security systems utilizing SAEs to detect harmful internal states (e.g., deception, bias, or toxicity monitoring) can be evaded without triggering alerts.
Integrity of Interpretability: The reliability of SAEs as tools for "trustworthy AI" is compromised, as feature activations can be spoofed or suppressed by adversarial actors.
False Positives/Framing: Benign inputs can be perturbed to trigger SAE features associated with harmful behavior, potentially causing denial of service in moderated systems.

Affected Systems:

Large Language Models integrated with Sparse Autoencoders for mechanistic interpretability or runtime monitoring.
Specific configurations tested include:
Llama-3-8B (Layer 20) using TopK SAE (width 131k).
Gemma-2-9B (Layer 30) using JumpReLU SAE (width 131k).
Any downstream application relying on SAE latent representations for oversight or control.

SAE Interpretability Illusion

Research Paper