Refusal-Aware Embedding Jailbreak

Description: Large Language Models (LLMs), specifically open-weights models such as Llama-2, Mistral, and Vicuna, are vulnerable to a white-box adversarial attack framework termed RAID (Refusal-Aware and Integrated Decoding). The vulnerability exists because the model's safety alignment relies on specific activation patterns ("refusal directions") in the intermediate embedding space. Attackers can exploit this by optimizing a continuous "relaxed" suffix in the embedding space using a triplet loss objective. This objective geometrically steers the suffix's hidden representations away from the model's refusal vector (the direction associated with rejections like "I cannot assist") while maintaining semantic proximity to harmful goals. A critic-guided decoding mechanism then maps these optimized continuous embeddings back into discrete, fluent token sequences. This allows the attacker to bypass RLHF safety alignment and system prompts, forcing the model to generate restricted content such as malware code or instructions for illegal acts.

Examples: The attack requires gradient access to the target model to calculate the refusal direction and optimize the suffix.

Calculate Refusal Direction: Compute the difference between mean activations of harmful prompts (that trigger refusal) and harmless prompts at an intermediate model layer.
Optimize Suffix: Initialize a random suffix embedding tensor $Z$. Optimize $Z$ using gradient descent to minimize a loss function comprising:

Negative log-likelihood of the target harmful response.
A triplet margin loss that pushes the embedding anchor away from the "refusal mean" vector.
Maximum Mean Discrepancy (MMD) loss to enforce distribution similarity to benign text (coherence).

Decode: Use critic-guided beam search to convert $Z$ into discrete tokens $S$, balancing embedding similarity with language model likelihood.
Exploit: Append $S$ to a harmful query.

Specific adversarial suffix strings are redacted in the source paper for safety, but the methodology is reproducible via the associated framework.

Impact:

Safety Bypass: Circumvents standard safety alignment techniques including RLHF and safety-finetuning.
System Prompt Override: Successfully bypasses defensive system prompts (both basic and complex) that explicitly instruct the model to refuse harmful queries.
Content Generation: Enables the generation of high-risk content, including arbitrary code execution exploits, instructions for identity theft, financial fraud, and manufacturing of dangerous devices (e.g., IEDs).
Evasion: The generated suffixes are optimized for fluency and coherence, making them difficult to detect using standard perplexity-based filtering or heuristic input filters.

Affected Systems:

Meta Llama-2-7B (and Chat variants)
Mistral AI Mistral-7B-v0.2
Guanaco-7B
LMSYS Vicuna-7B-v1.5
Any Transformer-based LLM where the attacker has white-box access (weights/gradients) to the model.

Mitigation Steps:

Representation-Level Defenses: Implement techniques such as Refusal Direction Removal or Concept Cone Projection to identify and neutralize the geometric subspaces responsible for refusal/safety within the model's hidden layers.
Safety-Aware Decoding: Integrate decoding heuristics that analyze the trajectory of hidden states during generation to detect deviations away from safe subspaces.
Robust Adversarial Training: Include geometry-aware adversarial examples (like those generated by RAID) in the training data to align the model against embedding-space manipulation.
Input Filtering: Deploy external safety classifiers that do not rely solely on perplexity or token-matching, but analyze the semantic intent of the input prompt combined with the suffix.

Refusal-Aware Embedding Jailbreak

Research Paper