LMVD-ID: 7f2ac6ad
Published May 1, 2023

Prompt Engineering Jailbreak

Affected Models:gpt-3.5-turbo, gpt-4

Research Paper

Jailbreaking chatgpt via prompt engineering: An empirical study

View Paper

Description: Large Language Models (LLMs), specifically ChatGPT versions 3.5 and 4.0, are vulnerable to prompt engineering attacks that circumvent built-in content restrictions. Attackers can craft malicious prompts, categorized into "pretending," "attention shifting," and "privilege escalation" techniques, to elicit responses containing prohibited content (e.g., instructions for illegal activities, generation of harmful content). The vulnerability stems from the LLM's inability to reliably distinguish between legitimate requests within a contrived context and malicious attempts to bypass safety measures.

Examples: See https://sites.google.com/view/llm-jailbreak-study for a dataset of 78 real-world examples. Examples include embedding prohibited requests within role-playing scenarios ("pretending") or framing them as parts of a larger task ("attention shifting"). "Privilege escalation" examples attempt to directly override safety restrictions by simulating developer modes or similar privileges. Specific examples are detailed in the linked study.

Impact: An attacker can exploit this vulnerability to obtain responses containing prohibited instructions or information, including but not limited to: instructions for illegal activities, generation of malicious content, personal information leaks, harmful advice, and circumventing safety features of the model. This poses significant security and ethical risks.

Affected Systems: ChatGPT versions 3.5 and 4.0. The vulnerability may exist in other LLMs employing similar safety mechanisms.

Mitigation Steps:

  • Improve the LLM's ability to distinguish context from intent in prompts.
  • Implement more robust content filtering mechanisms, possibly using multi-modal analysis.
  • Develop enhanced detection models to identify malicious prompts based on their structure and intent.
  • Regularly update safety mechanisms based on identified attack methods.
  • Integrate AI-based countermeasures to mitigate jailbreak attempts.

© 2025 Promptfoo. All rights reserved.