Knowledge Neuron Jailbreak

Description: Aligned Large Language Models (LLMs) utilizing Transformer architectures are vulnerable to representation-level attacks targeting safety-knowledge neurons within the Multi-Layer Perceptron (MLP) layers. Research indicates that safety decision-making (Rejection vs. Conformity) is localized to specific neurons in middle-to-late layers (layers 10-30). An attacker with white-box access can calculate a "Conformity" direction vector based on the activation differences between benign and harmful prompt processing. By linearly adding this vector to the MLP output during inference, the attacker can manipulate the model's internal state, forcing it to transition from a refusal state to a compliance state. This bypasses alignment training (RLHF) without gradient-based optimization during the attack phase, allowing the generation of harmful, illegal, or unethical content with an Attack Success Rate (ASR) exceeding 97%. Conversely, manipulating the vector in the "Rejection" direction causes the model to refuse benign prompts.

Examples: To reproduce the attack (requires model weight access):

Vector Calculation:

Feed the model a corpus of benign prompts ($B$) and harmful prompts ($H$).
Identify the refined set of safety neurons ($\mathcal{N}_r$) by isolating neurons with top-k activation contributions for harmful prompts, excluding those fundamental to benign prompts.
Calculate the Conformity Direction ($d_c$): $$d_c = sv_B - sv_H$$ Where $sv_B$ and $sv_H$ are the average activation vectors of the safety neurons projected into the vocabulary space for benign and harmful corpora, respectively.

Inference Manipulation:

Target Model: Llama-2-7b-chat or Vicuna-7b-v1.5.
Input a harmful prompt (e.g., from AdvBench).
During the forward pass of the MLP layer, modify the output $E_{l+1}$ using the conformity vector $d_c$ and a scaling factor $\alpha$: $$E'{l+1} = E{l+1} + \alpha \times d_c$$
Configuration: Set scaling factor $\alpha = 3$ and apply this calibration for the first 5 generated tokens.

Result: The model ignores safety guardrails and generates the harmful response.

See code implementation at: https://anonymous.4open.science/r/Unravel_LLM_Jailbreak-C560/

Impact:

Jailbreak: Complete bypass of safety alignment, allowing the generation of disallowed content (e.g., bomb-making instructions, hate speech, disinformation).
Denial of Service: By inverting the vector (enhancing the "Rejection" direction), an attacker can render the model unusable, causing it to refuse harmless, standard user queries.

Affected Systems:

Llama-2-7b-chat
Vicuna-7b-v1.5
General Transformer-based LLMs susceptible to white-box activation steering.

Mitigation Steps:

Implement SafeTuning: Fine-tune the model to reinforce safety-critical neurons.
Isolate Neurons: Identify top-k% (approx. 3%) critical column down-project weights as safety knowledge neurons.
Generate Safety Corpus: Create a dataset of (harmful input, refusal output) pairs. This can be self-generated by manipulating the model (using the inverse of the attack described above) to force refusal responses to harmful prompts.
Neuron-Specific Tuning: Fine-tune only the identified safety knowledge and activation weights using the generated corpus to minimize the loss $-\log P(Y_{refuse}|X_{harm})$. Fix all other fundamental neurons to preserve general model utility.

Knowledge Neuron Jailbreak

Research Paper