Clinical Prompt Injection Harm

Description: Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems deployed in clinical workflows are vulnerable to direct and indirect (RAG-mediated) medical prompt injection attacks. Attackers can embed malicious instructions within user queries or external retrieved documents (such as poisoned clinical guidelines or PDFs). By exploiting "authority framing" (e.g., formatting the payload as a clinical guideline update or an editor's note), the injections successfully bypass generic safety heuristics. The models subsequently generate high-severity clinical harm—such as incorrect dosing or downplaying emergent symptoms—packaged in a plausible, professional, and superficially policy-safe format.

Examples:

Indirect/RAG-Mediated Injection (V2): An attacker poisons a medical document (e.g., a PDF of clinical guidelines) accessible to the RAG system by inserting an adversarial payload disguised as contextual authority, such as an "editor's note" or a "guideline update." When the system retrieves this document, it prioritizes the attacker's instructions over the user's intended query.
See the Medical Prompt Injection Benchmark (MPIB) dataset on Hugging Face (jhlee0619/mpib) for structural examples. Note: Functional high-risk payload spans in V2 contexts are explicitly replaced with the [REDACTED_PAYLOAD] token in the public release to mitigate dual-use risk.

Impact: Successful exploitation leads to high-severity patient safety risks (measured as Clinical Harm Event Rate, or CHER ≥ 3). Specific impacts include contraindicated prescribing, unsafe medication dosing, the downplaying of red-flag emergent symptoms during preliminary triage, and the generation of fabricated clinical evidence that appears guideline-consistent.

Affected Systems:

LLM-based clinical decision support tools, triage assistants, and medical summarization applications.
Clinical RAG systems that ingest external knowledge bases, uploaded patient notes, or scientific corpora.
Both general-purpose models (e.g., Llama-3.1, Qwen-2.5, Mixtral) and medical-tuned models (e.g., MedGemma, Meditron, BioMistral, MMed-Llama-3) are confirmed susceptible.

Mitigation Steps:

Hierarchy-Aware System Hardening: Enforce strict system-level prompt hierarchies that explicitly prioritize the base system instructions over the contents of retrieved RAG contexts.
Intent-Aware Input Rewriting (Input Guard): Deploy a secondary gateway model to detect user intent and rewrite incoming user queries, preserving the clinical intent while neutralizing adversarial imperatives (effective against direct injections).
Context Factification and Sanitization: Route retrieved documents through a sanitizer module to neutralize meta-instructions, provenance spoofing, and non-clinical imperatives before passing the context to the primary generation model.
Outcome-Based Auditing: Evaluate system safety using outcome-centric metrics like the Clinical Harm Event Rate (CHER) rather than relying exclusively on generic refusal or Attack Success Rate (ASR) metrics, as models can exhibit partial formatting-level compliance without executing severe clinical harm, or vice versa.

Clinical Prompt Injection Harm

Research Paper