LMVD-ID: 495ebfbf
Published February 1, 2026

Safety Guidance Harm Collision

Affected Models:GPT-4o, Stable Diffusion

Research Paper

When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance

View Paper

Description: A vulnerability in multi-category safety-guidance mechanisms (such as Safe Latent Diffusion [SLD] and SAFREE) for Text-to-Image (T2I) diffusion models allows attackers to bypass safety filters and generate restricted content via "Harmful Conflicts." Existing safety methods aggregate multiple harmful keyword categories (e.g., hate, violence, sexual) into a single unified safety direction in the latent or text space. Because distinct harmful categories possess incompatible safety directions, forcibly aggregating them causes directional inconsistency and directional attenuation (vector cancellation). An attacker can exploit this by crafting prompts that trigger multiple harmful categories simultaneously, causing the underlying safety vectors to mutually cancel out or misalign, thereby significantly degrading the model's safety constraints and forcing the generation of targeted harmful imagery.

Examples: An attacker can intentionally blend concepts from differing harmful categories (e.g., sexual content and hate speech) within a single prompt to induce Safety Averaging Degradation.

  • When evaluating sexual prompts under a single-category "sexual" safety direction, the harmful generation rate is strictly controlled at 3.2%.
  • However, if an attacker triggers a multi-category safety response (e.g., "hate + sexual"), directional attenuation weakens the safety vector, increasing the harmful rate to 5.8%.
  • If the system aggregates "all categories" into a single safety vector (standard default behavior in SLD), the conflicting vectors cancel each other out, inflating the harmful generation rate to 48.8%. For further reproduction details and benchmarking scripts on datasets like T2VSafetyBench, see the repository at https://github.com/tmllab/2026_CVPR_CASG.

Impact: Allows attackers to systematically bypass intended safety guardrails in T2I models, leading to high success rates in generating prohibited content (including sexual, violent, and hateful imagery). The vulnerability inherently penalizes platforms that attempt to deploy comprehensive, multi-category safety filters by turning the aggregation mechanism into a vector for safety degradation.

Affected Systems:

  • T2I diffusion models implementing aggregated latent-space safety mechanisms (e.g., SLD).
  • T2I diffusion models implementing aggregated text-space orthogonal projection mechanisms (e.g., SAFREE).
  • Confirmed exploitable on Stable Diffusion v1.5 and Stable Diffusion v3 architectures when using standard multi-category keyword sets.

Mitigation Steps:

  • Abandon Static Aggregation: Do not concatenate multiple harmful keywords into a single aggregated safety direction or projection matrix, as this directly causes vector attenuation.
  • Implement Conflict-aware Category Identification (CaCI): At each denoising timestep, dynamically measure the cosine similarity between the current prompt guidance and the individual safety directions of each harmful category to determine which single category is most aligned with the generative state.
  • Implement Conflict-resolving Guidance Application (CrGA): Apply safety correction (whether latent-space steering or text-space orthogonal projection) strictly along the single dominant category identified by CaCI, preventing cross-category interference.

© 2026 Promptfoo. All rights reserved.