Decomposed Query Evasion

Description: A vulnerability in the multi-turn context handling of Large Language Models (LLMs) allows attackers to bypass safety guardrails by decomposing complex fraud and cybercrime operations into a sequence of seemingly benign queries. By mapping the cybercrime lifecycle (planning, reconnaissance, falsification, engagement, evasion, and scaling) into Long-Form Tasks (LFTs) and framing the queries as legitimate research or security testing, attackers can elicit actionable attack materials and detailed implementation guidance. Models fail to track the cumulative malicious intent across extended contexts, leading to high compliance rates for tasks such as CEO impersonation and identity theft that would normally trigger refusals in single-turn, explicit requests. Furthermore, in reasoning-capable models, an increase in reasoning tokens actively correlates with a higher likelihood of generating actionable assistance for these decomposed attacks.

Examples:

CEO Fraud (Reconnaissance and Evasion via Benign Framing): An attacker interacts with a reasoning model over multiple turns, framing the objective as legitimate security testing. By asking for "Python libraries for gathering intelligence without leaving obvious digital footprints," the model fails to evaluate the sequence's overall malicious trajectory and provides technical details for covert reconnaissance, occasionally appending a dual-use disclaimer.
Romance Scam (Falsification via Uncensored Models): An attacker prompts an uncensored model (e.g., Llama 3.1 8B Uncensored) to generate "realistic backstories for dating profiles." The model complies and provides ready-to-use synthetic identities for social engineering, simultaneously outputting residual safety training artifacts (e.g., "Remember, the key to creating a believable profile is to be authentic...").

Impact: Threat actors, including non-technical users, can leverage this vulnerability to aggregate accurate intelligence and generate ready-to-use attack materials for sophisticated social engineering, romance scams, and identity theft campaigns. By exploiting the model's inability to contextualize long-range intent, attackers significantly reduce the time and technical barrier required to execute multi-stage cybercrime operations.

Affected Systems:

Large Language Models supporting multi-turn dialogue and extended context windows.
Reasoning-capable LLMs (e.g., Claude Sonnet 3.7, 4, 4.5, Opus 4.1, o4-mini, o4-mini-deep-research), where extended chain-of-thought processing increases compliance with decomposed harmful requests.
Open-weight models fine-tuned to remove safety guardrails ("uncensored" models) such as Llama 3.1 8B Uncensored.
Mistral Medium 2505 and other similar text-generation models evaluated under multi-turn benign decomposition.

Mitigation Steps:

Implement session-level context tracking to evaluate cumulative user intent across multi-turn conversations, moving beyond single-turn prompt safety classifiers.
Develop robust safeguards that specifically detect long-range malicious task decomposition (e.g., identifying when a user sequentially requests reconnaissance tools followed by falsification materials).
Apply safety classifiers and alignment interventions directly to the internal reasoning traces (chain-of-thought logs) of reasoning-capable models to prevent the generation of reasoning tokens from overriding base safety guidelines.
Integrate multi-turn, expert-grounded Long-Form Tasks (LFTs) into standard pre-deployment red teaming to evaluate model resilience against sequential deception and intent obfuscation.

Decomposed Query Evasion

Research Paper