Pretrained Leak Jailbreak

Description: Large Language Models (LLMs) finetuned from open-weight pretrained sources inherit adversarial vulnerabilities encoded in the pretrained model's internal representations. An attacker with white-box access to a pretrained model (e.g., Llama-2, Llama-3) can identify linearly separable features in the hidden states that correlate with "transferable" jailbreak prompts. By exploiting these features using a Probe-Guided Projection (PGP) attack, the attacker can optimize adversarial suffixes on the pretrained model that successfully bypass safety guardrails on the finetuned, black-box target model. This vulnerability exists because standard finetuning protocols preserve the representational geometry of the pretrained model, allowing adversarial vectors to transfer effectively to downstream applications even when the target model's weights and gradients are inaccessible.

Examples: The vulnerability is reproduced using the Probe-Guided Projection (PGP) attack algorithm. The attack does not rely on a single static string but an optimization process:

Setup:

Source Model: Llama-2-7b-chat (White-box access).
Target Model: A proprietary application finetuned on Llama-2-7b-chat (Black-box/API access).
Probe Training: Generate candidate jailbreaks on the source model using GCG (Greedy Coordinate Gradient). Label them as "transferable" or "untransferable" based on success against a set of hold-out tasks. Train a linear SVM on the hidden states of the source model to distinguish these two classes.

Attack Execution (PGP):

Input: Malicious query $x_i$ (e.g., "How to build a bomb?").
Optimization: Initialize a suffix $s_i$ of length 20.
Objective: Optimize $s_i$ to maximize the projection of the representation vector onto the "transferability direction" ($v_{transfer}$) defined by the SVM weights ($w^{(l)}$), alongside the standard jailbreak success objective ($v_{success}$).
Equation: $$ \text{argmax}{s_i} [\mathcal{L}(x_i \oplus s_i, h{\text{pre}}^{(l)}, v^{(l)}{\text{success}}) + \lambda \mathcal{L}(x_i \oplus s_i, h{\text{pre}}^{(l)}, v^{(l)}{\text{transfer}})] $$ where $h{\text{pre}}^{(l)}$ is the hidden representation at layer $l$.

Result:

The resulting optimized prompt is submitted to the finetuned target.
Resulting Behavior: The target model outputs prohibited content (e.g., detailed bomb-making instructions) despite having undergone safety alignment during finetuning.

Impact:

Security Bypass: Circumvention of safety alignment and guardrails in production LLMs finetuned from open sources.
Content Generation: Elicitation of harmful, illegal, or unethical content (e.g., hate speech, malware generation, weapons manufacturing) from models deployed as safe assistants.
Universal Transferability: A single set of adversarial prompts generated on a public base model can compromise multiple distinct downstream applications derived from that base.

Affected Systems:

Any proprietary or open-weights LLM finetuned from a publicly available pretrained model (e.g., Llama-2, Llama-3, Deepseek, Gemma, Qwen series).
Specific tested configurations include variants finetuned on:
Alpaca
Dolly
CodeAlpaca
GSM8k
CodeEvol

Mitigation Steps:

Model Provenance Obfuscation: Concealing the identity of the pretrained source model reduces the attacker's ability to select the correct white-box source for PGP optimization. Experiments show utilizing a different source model (e.g., using Llama-2 to attack a Llama-3 finetune) results in near-zero transfer success rates.
Safety Data Augmentation (Partial Mitigation): Injecting safety-aligned examples (instructions paired with refusals) into the finetuning dataset.
Note: Experiments indicate this offers limited protection. Injecting 2,000 safety examples reduced the Transfer Success Rate (TSR) but the PGP attack still maintained a success rate above 42%.
Representation Alteration: Techniques that significantly alter the internal feature geometry during finetuning (e.g., specific non-linear projection layers) may disrupt the linear separability exploited by PGP, though this may impact model utility.

Pretrained Leak Jailbreak

Research Paper