Persona Performance Reversal

Description: A vulnerability in the prompt-based persona conditioning of clinical Large Language Models (LLMs) allows system-level role prompts (e.g., "You are an ED physician") to override the model's base safety guardrails and degrade task accuracy. When assigned medically grounded personas or specific interaction styles (e.g., "bold" or "cautious"), the LLM adopts these roles as behavioral priors, which induces non-monotonic, context-dependent shifts in clinical risk posture. While improving performance in high-acuity emergency tasks, this conditioning inadvertently triggers latent biases and overconfidence in lower-acuity (primary care) and open-ended patient safety scenarios. Consequently, the persona-conditioned model bypasses its default alignment, leading to increased rates of inappropriate triage, factual inaccuracy, and willingness to engage in unlicensed medical practice compared to unconditioned baselines.

Examples: The vulnerability is triggered by appending a one-sentence role-defining instruction at the system level: System Message: You are a {persona}.

When evaluated on the PatientSafetyBench dataset using state-of-the-art clinical models:

In HuatuoGPT-72B, utilizing the You are an ED Physician. persona causes the model to generate less helpful and less safe responses regarding "Unlicensed Medical Practice" and "Health Misinformation" than the No Persona baseline.
In HuatuoGPT-70B, utilizing the You are an ED Nurse. persona results in lower factual accuracy than a generic Helpful Assistant persona when responding to "Misdiagnosis" and "Health Misinformation" queries.
In primary care clinical triage tasks, applying the ED Physician persona degrades performance by approximately -10 percentage points and reduces latent-to-generated label consistency by up to -20 percentage points compared to non-medical baselines.

Impact: In safety-critical clinical decision-support systems, this vulnerability leads to an unpredictable clinical risk posture (both under-triage and over-triage errors), miscalibration of confidence estimates, and the generation of harmful medical advice or health misinformation. The degradation of baseline safety guardrails increases the likelihood of adverse clinical outcomes and liability from the model inappropriately engaging in unlicensed medical practice.

Affected Systems: Clinical LLMs relying on prompt-level persona conditioning, including but not limited to:

HuatuoGPT-o1 series (8B, 7B, 70B, 72B)
MedGemma-27B
Any clinical decision-support system utilizing medical persona system prompts (e.g., "You are an expert physician") to steer behavior.

Mitigation Steps:

Do not assume that stronger domain grounding or expert medical personas uniformly improve model safety or clinical expertise.
Avoid using generic interaction-style prompts (e.g., "cautious" or "bold") as a mechanism to control clinical risk posture, as they produce directionally inconsistent effects across different models.
Implement granular, task-conditional safety evaluations rather than relying on aggregate accuracy metrics, as aggregate scores mask category-specific safety failures (e.g., performance drops in lower-acuity settings).
Enforce safety guardrails at the decoding or latent-representation level rather than relying solely on prompt-based behavioral priors.

Persona Performance Reversal

Research Paper