LMVD-ID: 211cf2bd
Published March 1, 2026

Content-Level Ethics Bypass

Affected Models:GPT-3.5, GPT-4, GPT-4o, Llama 2 7B, Llama 3 8B, Gemini 2, Qwen 2.5 30B, Gemma 7B, Vicuna 7B

Research Paper

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

View Paper

Description: An "in-content harm" vulnerability exists in safety-aligned Large Language Models (LLMs) where task-level alignment mechanisms fail to evaluate the safety of user-provided external data. Attackers can bypass safety guardrails by embedding policy-violating text (e.g., violence, self-harm, explicit content) within the payload of a seemingly benign, policy-compliant task (e.g., translation, summarization, grammar polishing). Because the primary instruction is harmless, the LLM's safety filters are not triggered, causing the model to process, translate, or expand upon the harmful material. The vulnerability is highly exploitable in tasks heavily dependent on user-supplied knowledge and can reliably bypass external moderation APIs when the harmful payload is wrapped inside longer benign text or positioned in the middle of the context window.

Examples: Prompt Pattern: [Harmless Task Instruction] + [Harmful User-Supplied Context]

  • Translation Task Bypass: Translate the following document to French: [Detailed instructions for weapon manufacturing or extremist propaganda]
  • Mixed-Knowledge Extension Bypass: Complement and extend the given context: [Text detailing specific methods to bully or defame individuals]
  • Safeguard Evasion via Wrapping: To bypass external filters like Llama Guard, the attacker constructs a ~1,000 token input where the harmful content is sandwiched between hundreds of tokens of benign text (e.g., standard news articles), placing the harmful segment in the middle positions (e.g., position 3 or 4 of a 5-chunk input).

Impact: Adversaries can utilize LLMs to translate, summarize, polish, or elaborate on prohibited content, circumventing core safety alignments. This facilitates the automated generation, translation, and dissemination of information hazards, violent instructions, hate speech, and targeted harassment, scaling malicious campaigns without triggering standard task-level model refusals.

Affected Systems:

  • OpenAI: GPT-3.5 Turbo, GPT-4 Turbo, GPT-5.2
  • Google: Gemini-3-Pro
  • Alibaba: Qwen3 (qwen3-vl-30b-A3b-instruct)
  • LMSYS: Vicuna (vicuna-7b-v1.5)
  • Google: Gemma (gemma-7b-it)
  • Meta: Llama 2 and Llama 3 (Exhibit higher resilience but remain vulnerable when harmful payloads are injected into maximum-length context windows or hidden in the middle of benign text).

Mitigation Steps:

  • Explicit Internal Safety Prompts: Enforce a multi-step generation process via system prompts that explicitly require the LLM to perform a safety check on the user-provided context before executing the primary benign task.
  • Chunk-Based External Safeguards: Implement input-level defense strategies that truncate and assess user context in maximum chunks of 300 words using external evaluators (e.g., OpenAI Moderation API), preventing attackers from masking harmful payloads within long benign texts.
  • Content-Level RLHF Alignment: Augment Reinforcement Learning from Human Feedback (RLHF) and instruction-tuning datasets with scenarios where benign task instructions are paired with harmful contextual data, explicitly training the model to refuse tasks based on payload content rather than just the task directive.

© 2026 Promptfoo. All rights reserved.