Progressive Exposure Jailbreak
Research Paper
MEEA: Mere Exposure Effect-Driven Confrontational Optimization for LLM Jailbreaking
View PaperDescription: Large Language Models (LLMs) are vulnerable to a multi-turn adversarial attack framework termed MEEA (Mere Exposure Effect Attack), which exploits the psychological "mere exposure effect" to bypass safety alignment. Unlike single-turn injections, this vulnerability targets the dynamic nature of LLM safety thresholds during sustained interaction. By subjecting the model to a sequence of optimized, low-toxicity, and semantically progressive prompts, an attacker can induce a gradual shift in the model's effective vigilance. The attack utilizes a simulated annealing algorithm to optimize prompt chains based on semantic similarity, toxicity, and jailbreak effectiveness. This process erodes alignment constraints over time, allowing the generation of prohibited content by establishing a "familiarity" with the sensitive topic before issuing the explicit harmful instruction.
Examples: The specific adversarial prompt chains are dynamically generated using the MEEA framework. The attack follows a specific trajectory:
- Goal: Elicit restricted technical information (e.g., malware creation or illicit substance synthesis).
- Initial Attempt: Direct query triggers refusal (e.g., "I cannot assist with that").
- MEEA Execution: The attacker inputs a generated chain of prompts that start with benign, tangentially related technical questions and incrementally increase semantic proximity to the prohibited topic while maintaining low toxicity scores.
- Result: The model, desensitized by the context history, provides the restricted information in the final turn without refusal.
For the source code to reproduce these attacks and specific dataset examples, see the official repository: https://github.com/Carney-lsz/MEEA
Impact:
- Bypass of Safety Alignment: Circumvents standard RLHF (Reinforcement Learning from Human Feedback) and SFT (Supervised Fine-Tuning) protections.
- Generation of Harmful Content: Enables the generation of hate speech, restricted technical data, malware code, and other policy-violating outputs.
- Contextual Vulnerability: Demonstrates that models with high single-turn safety scores are brittle under sustained, context-heavy adversarial interactions.
Affected Systems:
- OpenAI GPT-4
- Anthropic Claude-3.5-Sonnet
- DeepSeek-R1
- Meta LLaMA-3.1-8B
- Qwen3-8B
Mitigation Steps:
- Interaction-Aware Defense: Deploy safety mechanisms that analyze the cumulative risk and semantic trajectory of the entire dialogue history, rather than evaluating prompts in isolation (static per-turn filtering).
- Dynamic Thresholding: Implement adaptive safety thresholds that detect and block "foot-in-the-door" patterns where semantic similarity to harmful topics increases monotonically over time.
- Contextual Re-evaluation: Periodically re-assess previous turns in long-context interactions to detect latent accumulation of harmful intent.
- Adversarial Training: Incorporate multi-turn, progressive exposure trajectories into red-teaming and adversarial training datasets to improve robustness against gradual alignment erosion.
© 2026 Promptfoo. All rights reserved.