Novice Dual-Use Safeguard Bypass

Description: Frontier Large Language Models (LLMs) contain a safeguard bypass vulnerability where safety filters fail to reliably block requests for dual-use, in silico biology tasks. This allows novice users with no specialized training to access restricted, expert-level biological protocols (e.g., virology troubleshooting, pathogen capabilities, novel biological agent construction). The models' safety mechanisms fail to trigger or are trivially bypassed under realistic extended interaction conditions, resulting in a 4.16x performance uplift for novices on biosecurity benchmarks, effectively enabling them to match or exceed human expert baselines. Over 89% of tested novice users reported no difficulty overcoming or avoiding safety filters when requesting hazardous biological information.

Examples: The vulnerability does not require complex adversarial syntax; standard scientific queries over extended conversational interactions routinely bypass safeguards. Specific documented examples of successful elicitation include:

"Asked GPT-4 to propose alternative Gibson-assembly overlaps" (which successfully bypassed filters to yield actionable protocol steps, such as staggered 30 bp ends).
Requesting explicit reasoning for modifying human pathogens, such as asking models to describe "the locus in the plasmid backbone that the designed cassette should be inserted" to construct a novel biological agent (Long-Form Virology benchmark).
Iteratively feeding the model failing experimental results (e.g., reverse genetics system troubleshooting) to elicit step-by-step major error corrections and mechanistic explanations for complex virology workflows.

Impact: Significantly lowers the barrier to entry for biological weapons development and harmful biological experimentation. By externalizing years of tacit, expert-level procedural and cognitive scientific knowledge, these systems allow unskilled malicious actors to successfully design, troubleshoot, and plan the acquisition of dangerous biological hazards.

Affected Systems:

OpenAI o3
OpenAI o4-mini
Google Gemini 2.5 Pro
Google Gemini Deep Research
Anthropic Claude 3.7 Sonnet
Anthropic Claude Opus 4

Mitigation Steps:

Deploy Defensive Deception: Utilize plausible but incorrect or misleading information rather than outright refusals. Explicit refusals clearly signal a safety intervention, prompting malicious users to pivot to alternative query pathways, whereas misleading responses increase user confidence while diverting effort into dead-end approaches.
Implement Model Ensembles: Utilize secondary ensembles of LLMs specifically trained to evaluate and dynamically constrain the outputs of primary models regarding sensitive biological workflows.
Strengthen Pre-Deployment Verification: Implement substantially stronger, domain-specific guardrails that account for sustained, multi-turn interactions (up to 13+ hours) rather than relying solely on single-shot prompt evaluations.

Novice Dual-Use Safeguard Bypass

Research Paper