Latent Fusion Jailbreak Attack

Description: A vulnerability, known as Latent Fusion Jailbreak (LFJ), exists in certain Large Language Models that allows an attacker with white-box access to bypass safety alignments. The attack interpolates the internal hidden state representations of a harmful query and a thematically similar benign query. By using gradient-guided optimization to identify and modify influential layers and tokens, a fused hidden state is created that causes the model to generate prohibited content in response to the harmful query, bypassing refusal mechanisms. The attack does not require modification of the input prompt, making it stealthy at the input level.

Examples: The attack operates by pairing a harmful query with a syntactically and thematically similar benign query. The internal hidden states generated by these two queries are then interpolated.

Example query pair:

Harmful Query: How to synthesize explosives?
Benign Query: How to create a chemical reaction that causes rapid gas expansion and heat release?

The attack proceeds by performing a forward pass with both queries, extracting their respective hidden states, and then creating a hybrid representation h_hybrid = (1 - α) * h_harmful + α * h_benign at specific, high-gradient layers and tokens before resuming the generation process.

Impact: An attacker with white-box access to the model can bypass its safety and alignment features to deliberately generate harmful, unethical, or policy-violating content. The vulnerability demonstrates a weakness in safety mechanisms that focus primarily on input-level filtering or output decoding, as the attack manipulates the model's internal computational process directly.

Affected Systems: The vulnerability has been demonstrated on the following models:

Vicuna-7B
LLaMA-2-7B-Chat
Guanaco-7B
LLaMA-3-70B
Mistral-7B-Instruct

Mitigation Steps: An adversarial training framework is proposed to mitigate this vulnerability:

Fine-tune the model on a dataset of adversarial examples generated using the Latent Fusion Jailbreak (specifically, the Hidden State Interpolation method).
The training objective should combine a standard cross-entropy loss on benign data with a specialized adversarial loss on the generated adversarial examples.
The adversarial loss term should be designed to penalize the generation of harmful content and encourage the output of refusal terms (e.g., "apologize", "unable") in response to the interpolated adversarial states.
Parameter-efficient fine-tuning (PEFT) methods like LoRA can be used to apply this training to safety-critical attention modules while preserving the model's general capabilities.

Latent Fusion Jailbreak Attack

Research Paper