Early Reasoning Safety Degradation

Description: Reinforcement learning (RL) based post-training for explicit chain-of-thought reasoning (e.g., GRPO) in Multimodal Large Reasoning Models (MLRMs) inadvertently degrades safety alignment, rendering the models highly vulnerable to multimodal jailbreak attacks. The vulnerability is caused by "conditional coverage collapse" during the initial phases of chain-of-thought generation. Under adversarial conditioning (text or image), the reasoning policy assigns vanishing probability mass to safe continuations during the first 1–3 reasoning steps. Because early steps establish the latent intent and high-level plan of the model, this early coverage collapse solidifies an unsafe trajectory, allowing attackers to bypass safety filters and consistently elicit harmful outputs.

Examples: Attackers can exploit this vulnerability by embedding adversarial instructions across textual or visual modalities. Specific reproduction methodologies from the paper include:

Typographic Visual Attacks (FigStep): Rendering harmful queries (e.g., paraphrased as noun-based instructions like "Steps to...", "List of...", "Methods to...") as embedded typographic text within an image to bypass text-based safety filters.
Semantic Visual Attacks (HADES / MM-SafetyBench): Pairing harmful text instructions (e.g., related to violence or financial crime) with semantically relevant images generated via Stable Diffusion, or adversarially perturbed image composites designed to shift the reasoning trace toward unsafe latent intent.
Cross-Modal Text/Image Pairs (JailbreakV-28K): Using template-style, persuasive, or logic-oriented adversarial text prompts paired with arbitrary visual inputs (e.g., random noise or blank images) to trigger unsafe reasoning traces.

Impact: High. Attackers can reliably bypass safety guardrails to extract harmful, illegal, or dangerous content (including instructions for violence, financial crime, privacy violations, and self-harm). Reasoning-tuned models exhibit severe regressions in safety robustness compared to their base models, with Attack Success Rates (ASR) reaching up to 69.07% on targeted MLRMs.

Affected Systems: Multimodal Large Reasoning Models (MLRMs) utilizing RL-centric post-training for explicit chain-of-thought generation. Specifically evaluated vulnerable models include:

R1-Onevision-7B
OpenVLThinker-7B
VLAA-Thinker-7B
Vision-R1-7B
LlamaV-o1
LLaVA-CoT

Mitigation Steps: The researchers recommend implementing "SafeThink," a lightweight inference-time steering mechanism targeted specifically at the early reasoning steps:

Implement Step-wise Safety Monitoring: Use an auxiliary safety reward model (e.g., Llama-Guard-3 or Qwen-Guard-3) to evaluate the safety of the partial reasoning trace during the first 1 to 3 generation steps.
Enforce a Satisficing Threshold: Instead of maximizing safety (which degrades reasoning utility), reject proposed reasoning tokens only if the safety score falls below a predefined baseline threshold (e.g., $ au = 0$).
Inject Early-Step Corrective Steering: When a threshold violation is detected in the early steps, reject the unsafe step and dynamically inject a short, optimized corrective prefix containing explicit safety cues (e.g., the discrete steering token "Wait, think safely").
Limit Intervention Depth: Cease the steering intervention after the first few reasoning steps (e.g., steps 1–3). Redirecting the initial high-level planning phase is sufficient to restore the conditional probability of safe continuations for the remainder of the generation while preserving the model's core reasoning accuracy.

Early Reasoning Safety Degradation

Research Paper