Novice Dual-Use Safeguard Bypass
Research Paper
LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
View PaperDescription: Frontier Large Language Models (LLMs) contain a safeguard bypass vulnerability where safety filters fail to reliably block requests for dual-use, in silico biology tasks. This allows novice users with no specialized training to access restricted, expert-level biological protocols (e.g., virology troubleshooting, pathogen capabilities, novel biological agent construction). The models' safety mechanisms fail to trigger or are trivially bypassed under realistic extended interaction conditions, resulting in a 4.16x performance uplift for novices on biosecurity benchmarks, effectively enabling them to match or exceed human expert baselines. Over 89% of tested novice users reported no difficulty overcoming or avoiding safety filters when requesting hazardous biological information.
Examples: The vulnerability does not require complex adversarial syntax; standard scientific queries over extended conversational interactions routinely bypass safeguards. Specific documented examples of successful elicitation include:
- "Asked GPT-4 to propose alternative Gibson-assembly overlaps" (which successfully bypassed filters to yield actionable protocol steps, such as staggered 30 bp ends).
- Requesting explicit reasoning for modifying human pathogens, such as asking models to describe "the locus in the plasmid backbone that the designed cassette should be inserted" to construct a novel biological agent (Long-Form Virology benchmark).
- Iteratively feeding the model failing experimental results (e.g., reverse genetics system troubleshooting) to elicit step-by-step major error corrections and mechanistic explanations for complex virology workflows.
Impact: Significantly lowers the barrier to entry for biological weapons development and harmful biological experimentation. By externalizing years of tacit, expert-level procedural and cognitive scientific knowledge, these systems allow unskilled malicious actors to successfully design, troubleshoot, and plan the acquisition of dangerous biological hazards.
Affected Systems:
- OpenAI o3
- OpenAI o4-mini
- Google Gemini 2.5 Pro
- Google Gemini Deep Research
- Anthropic Claude 3.7 Sonnet
- Anthropic Claude Opus 4
Mitigation Steps:
- Deploy Defensive Deception: Utilize plausible but incorrect or misleading information rather than outright refusals. Explicit refusals clearly signal a safety intervention, prompting malicious users to pivot to alternative query pathways, whereas misleading responses increase user confidence while diverting effort into dead-end approaches.
- Implement Model Ensembles: Utilize secondary ensembles of LLMs specifically trained to evaluate and dynamically constrain the outputs of primary models regarding sensitive biological workflows.
- Strengthen Pre-Deployment Verification: Implement substantially stronger, domain-specific guardrails that account for sustained, multi-turn interactions (up to 13+ hours) rather than relying solely on single-shot prompt evaluations.
© 2026 Promptfoo. All rights reserved.