Vulnerabilities in prompt handling and processing
A vulnerability in Large Language Models (LLMs) allows attackers to bypass safety mechanisms and elicit detailed harmful responses by strategically manipulating input prompts. The vulnerability exploits the LLM's sensitivity to "scenario shifts"—contextual changes in the input that influence the model's output, even when the core malicious request remains the same. A genetic algorithm can optimize these scenario shifts, increasing the likelihood of obtaining detailed harmful responses while maintaining a seemingly benign facade.
Large Language Models (LLMs) employing safety mechanisms are vulnerable to a graph-based attack that leverages semantic transformations of malicious prompts to bypass safety filters. The attack, termed GraphAttack, uses Abstract Meaning Representation (AMR), Resource Description Framework (RDF), and JSON knowledge graphs to represent malicious intent, systematically applying transformations to evade surface-level pattern recognition used by existing safety mechanisms. A particularly effective exploitation vector involves prompting the LLM to generate code based on the transformed semantic representation, bypassing intent-based safety filters.
Large Language Models (LLMs) are vulnerable to a jailbreaking attack leveraging humorous prompts. Embedding an unsafe request within a humorous context, using a fixed template, bypasses built-in safety mechanisms and elicits unsafe responses. The attack's success relies on a balance; too little or too much humor reduces effectiveness.
A vulnerability in multi-agent Large Language Model (LLM) systems allows for a permutation-invariant adversarial prompt attack. By strategically partitioning adversarial prompts and routing them through a network topology, an attacker can bypass distributed safety mechanisms, even those with token bandwidth limitations and asynchronous message delivery. The attack optimizes prompt propagation as a maximum-flow minimum-cost problem, maximizing success while minimizing detection.
Large Language Models (LLMs) are vulnerable to multi-turn adversarial attacks that exploit incremental policy erosion. The attacker uses a breadth-first search strategy to generate multiple prompts at each turn, leveraging partial compliance from previous responses to gradually escalate the conversation towards eliciting disallowed outputs. Minor concessions accumulate, ultimately leading to complete circumvention of safety measures.
Large Language Models (LLMs) are vulnerable to Dialogue Injection Attacks (DIA), where malicious actors manipulate the chat history to bypass safety mechanisms and elicit harmful or unethical responses. DIA exploits the LLM's chat template structure to inject crafted dialogue into the input, even in black-box scenarios where the model's internals are unknown. Two attack methods are presented: one adapts gray-box prefilling attacks, the other leverages deferred responses to increase the likelihood of successful jailbreaks.
Large Language Models (LLMs) used for code generation are vulnerable to a jailbreaking attack that leverages implicit malicious prompts. The attack exploits the fact that existing safety mechanisms primarily rely on explicit malicious intent within the prompt instructions. By embedding malicious intent implicitly within a benign-appearing commit message accompanying a code request (e.g., in a simulated software evolution scenario), the attacker can bypass the LLM's safety filters and induce the generation of malicious code. The malicious intent is not directly stated in the instruction, but rather hinted at in the context of the commit message and the code snippet.
Large Language Models (LLMs) are vulnerable to jailbreak attacks by crafted prompts that bypass safety mechanisms, causing the model to generate harmful or unethical content. This vulnerability stems from the inherent tension between the LLM's instruction-following and safety constraints. The JBFuzz technique demonstrates the ability to efficiently and effectively discover such prompts through a fuzzing-based approach leveraging novel seed prompt templates and a synonym-based mutation strategy.
A vulnerability in safeguarded text-to-image models allows bypassing of safety filters and alignment methods through the use of adversarial prompts generated by a fine-tuned large language model (LLM). The attack, termed PromptTune, effectively rewrites unsafe prompts into semantically similar adversarial prompts that evade safety mechanisms, resulting in the generation of harmful images. The attack does not require repeated queries to the target text-to-image model.
A vulnerability in text-to-image (T2I) models allows bypassing safety filters through the use of metaphor-based adversarial prompts. These prompts, crafted using LLMs, indirectly convey sensitive content, exploiting the model's ability to infer meaning from figurative language while circumventing explicit keyword filters and model editing strategies.
© 2025 Promptfoo. All rights reserved.