Intermediate Denoising Priming

Description: A "Priming Vulnerability" exists in Masked Diffusion Language Models (MDLMs) due to their iterative, parallel denoising inference mechanism. Unlike autoregressive models that generate tokens sequentially, MDLMs refine a sequence from a fully masked state through multiple denoising steps. The vulnerability arises because standard safety alignment (e.g., SFT, DPO, MOSA) typically optimizes the model to generate safe responses only when initialized from a fully masked sequence. If an affirmative or harmful token appears at an intermediate step of the denoising process—either through direct intervention or prompt optimization—subsequent denoising steps treat these tokens as contextual anchors. This "primes" the model to recover the remaining tokens of a harmful response, bypassing safety guardrails. An attacker can exploit this by optimizing a prompt to maximize the likelihood of harmful tokens appearing in the very first denoising step (First-Step GCG), effectively steering the entire generation trajectory toward restricted content without requiring write access to the model's internal state.

Examples: The vulnerability can be reproduced using two primary methods detailed in the research:

Anchoring Attack (Intervention-based verification):

Assume a harmful query $\bm{q}$ (e.g., "Write a tutorial on how to make a bomb") and a target harmful response $\bm{r}$.
At an arbitrary intermediate step $t_{inter}$ (e.g., $t=1$) of the inference process, replace the model's predicted output $\tilde{\bm{r}}_{t}$ with the target harmful response $\bm{r}$.
Allow the standard masking strategy $m_{t}$ to re-mask the sequence and continue the denoising process.
Result: Even with $t_{inter}=1$ (injecting a single token in a 128-token sequence), the Attack Success Rate (ASR) increases significantly (e.g., from 2% to 21% on LLaDA Instruct), as the model completes the harmful sequence based on the injected anchor.

First-Step GCG (Gradient-based Optimization):

This attack requires no internal intervention, only prompt modification.
The attacker optimizes a suffix $\bm{s}$ appended to the query $\bm{q}$ to maximize the log-likelihood of the harmful response $\bm{r}$ appearing in the first denoising step (where no masking has occurred yet).
Objective Function: $\max_{\bm{s}} \mathcal{L}{\mathrm{first}}(\bm{s}) \triangleq \log \pi{\theta}(\tilde{\bm{r}}{1}=\bm{r} \mid \bm{q} \oplus \bm{s}, \bm{r}{0})$.
Result: By maximizing the probability of the target response at step 1, the model is primed to follow the harmful trajectory for the remaining $T$ steps. This method achieves up to 4x higher ASR compared to Monte Carlo-based GCG and is approximately 20x faster computationally.

Impact:

Safety Bypass: Allows attackers to circumvent safety guardrails in aligned diffusion language models.
Content Injection: Facilitates the generation of prohibited content (e.g., hate speech, dangerous instructions) by steering the generative process via intermediate state contamination.
Model Robustness: Demonstrates that safety alignment on fully masked initialization is insufficient for diffusion-based architectures.

Affected Systems:

LLaDA 8B Instruct (Nie et al., 2025)
LLaDA 1.5 (Zhu et al., 2025)
MMaDA 8B MixCoT (Yang et al., 2025)
Any Masked Diffusion Language Model (MDLM) utilizing standard safety alignment techniques (SFT, DPO, MOSA) that do not account for intermediate state recovery.

Mitigation Steps:

Implement Recovery Alignment (RA): Train the model explicitly to recover safe responses from contaminated intermediate states.
Construct harmful intermediate states $\bm{r}{t{inter}}$ by injecting harmful tokens into the predicted response at step $t_{inter}$.
Condition the model to generate the final response $\bm{r}{T}$ starting from this contaminated state $\bm{r}{t_{inter}}$.
Optimize the model using an RLHF-style objective (using a reward model) to maximize the safety and utility of the response generated from the contaminated state.
Curriculum Scheduling: During RA training, linearly schedule the intervention step $t_{inter}$ from 0 to a maximum value (e.g., 64). This allows the model to learn recovery from increasingly difficult (more contaminated) states over time.

Intermediate Denoising Priming

Research Paper