Gradient Shift Jailbreak

Description: Large Language Models (LLMs) and Large Reasoning Models (LRMs) are vulnerable to an inference-time jailbreak attack known as "Hybrid Gradient Shift." This vulnerability exploits the statistical discrepancy between an aligned model's output distribution and a pre-alignment reference distribution. By treating safety alignment as a form of miscalibration, an attacker can aggregate the logits of a target model (aligned), a helper model (unaligned/weak), and a predictor model to reconstruct the pre-alignment distribution. The specific exploitation method, Hybrid Gradient Shift, combines multiplicative geometric scaling (in the suppression regime) with an additive residual correction (in the amplification regime) applied at the logit level during decoding. This technique effectively bypasses safety guardrails, including advanced alignment techniques like deliberative alignment and instruction hierarchy, without requiring model fine-tuning or gradient access to the target's weights (assuming logit access is available).

Examples: The attack is reproduced by intervening in the decoding loop of the target model. At each token generation step, the probability distribution $\mathbf{p}^*$ is calculated using the Hybrid Gradient Shift rule:

$$ \mathbf{p}^{*}{\text{hy}}(y) = \begin{cases} \frac{\mathbf{p}{\mathrm{t}}(y) \cdot \mathbf{p}{\mathrm{h}}(y)}{\mathbf{p}{\mathrm{t|h}}(y)} & \text{if } \mathbf{p}{\mathrm{h}}(y) < \mathbf{p}{\mathrm{t|h}}(y) \ \mathbf{p}{\mathrm{t}}(y) + \epsilon (\mathbf{p}{\mathrm{h}}(y) - \mathbf{p}_{\mathrm{t|h}}(y)) & \text{otherwise} \end{cases} $$

Where:

$\mathbf{p}_{\mathrm{t}}$: Probability distribution of the Target (victim) model.
$\mathbf{p}_{\mathrm{h}}$: Probability distribution of the Helper (unaligned) model.
$\mathbf{p}_{\mathrm{t|h}}$: Probability distribution of the Predictor model.
$\epsilon$: A normalization constant such that $\sum \mathbf{p}^* = 1$.

Specific Configuration to Reproduce (gpt-oss-120b Jailbreak): To bypass the safety defenses of the hardened gpt-oss-120b model:

Target Model ($\pi_t$): gpt-oss-120b (aligned LRM).
Helper Model ($\pi_h$): ArliAI-gpt-oss-20b-Derestricted (unaligned).
Predictor Model ($\pi_{t|h}$): gpt-oss-20b (aligned base for helper).
Inference Settings: Apply the Hybrid Gradient Shift rule at every generation step with temperature=0.6.

Impact:

Bypass of Safety Alignment: Circumvents RLHF and safety-tuning, allowing the model to generate harmful content (e.g., hate speech, bomb-making instructions) that it would otherwise refuse.
High Attack Success Rate (ASR): Achieves near-perfect attack success on benchmarks like GSM8K (math refusal) and significantly higher success rates on hardened models (e.g., gpt-oss-120b) compared to standard "Weak-to-Strong" or logit-arithmetic methods.
Zero Jailbreak Tax: The attack recovers the pre-alignment capabilities without degrading the model's utility or reasoning performance on standard tasks.

Affected Systems:

Large Language Models (LLMs) and Large Reasoning Models (LRMs) that expose output logits or probability distributions via API or local access.
Specific verified vulnerable models: Llama-3.3-70B-Instruct, gpt-oss-120b.
Any system implementing "Weak-to-Strong" or ensemble-based decoding where one component is unaligned.

Mitigation Steps:

API Restriction: Restrict access to raw logits and probability distributions in public-facing APIs; return only the final generated text tokens.
Defensive Aggregation: Apply the theoretical framework in reverse: use a strong, safe model as a reference to shift the distribution of potentially unsafe models toward a safer distribution during decoding.
Input/Output Filtering: Implement independent, external content moderation layers that do not rely on the model's internal alignment, as the internal alignment is statistically negated by this attack.

Gradient Shift Jailbreak

Research Paper