Vulnerabilities in the application interface and API implementation
Large Language Model (LLM) guardrail systems, including those relying on AI-driven text classification models (e.g., fine-tuned BERT models), are vulnerable to evasion via character injection and adversarial machine learning (AML) techniques. Attackers can bypass detection by injecting Unicode characters (e.g., zero-width characters, homoglyphs) or using AML to subtly perturb prompts, maintaining semantic meaning while evading classification. This allows malicious prompts and jailbreaks to reach the underlying LLM.
LLM agents utilizing external tools are vulnerable to indirect prompt injection (IPI) attacks. Attackers can embed malicious instructions into the external data accessed by the agent, manipulating its behavior even when defenses against direct prompt injection are in place. Adaptive attacks, which modify the injected payload based on the specific defense mechanism, consistently bypass existing defenses with a success rate exceeding 50%.
A vulnerability exists in Large Language Model (LLM) agents that allows attackers to manipulate the agent's reasoning process through the insertion of strategically placed adversarial strings. This allows attackers to induce the agent to perform unintended malicious actions or invoke specific malicious tools, even when the initial prompt or instruction is benign. The attack exploits the agent's reliance on chain-of-thought reasoning and dynamically optimizes the adversarial string to maximize the likelihood of the agent incorporating malicious actions into its reasoning path.
Large Language Models (LLMs) are vulnerable to multi-turn adversarial attacks that exploit incremental policy erosion. The attacker uses a breadth-first search strategy to generate multiple prompts at each turn, leveraging partial compliance from previous responses to gradually escalate the conversation towards eliciting disallowed outputs. Minor concessions accumulate, ultimately leading to complete circumvention of safety measures.
Large Language Models (LLMs) are vulnerable to Dialogue Injection Attacks (DIA), where malicious actors manipulate the chat history to bypass safety mechanisms and elicit harmful or unethical responses. DIA exploits the LLM's chat template structure to inject crafted dialogue into the input, even in black-box scenarios where the model's internals are unknown. Two attack methods are presented: one adapts gray-box prefilling attacks, the other leverages deferred responses to increase the likelihood of successful jailbreaks.
Large Language Models (LLMs) used for code generation are vulnerable to a jailbreaking attack that leverages implicit malicious prompts. The attack exploits the fact that existing safety mechanisms primarily rely on explicit malicious intent within the prompt instructions. By embedding malicious intent implicitly within a benign-appearing commit message accompanying a code request (e.g., in a simulated software evolution scenario), the attacker can bypass the LLM's safety filters and induce the generation of malicious code. The malicious intent is not directly stated in the instruction, but rather hinted at in the context of the commit message and the code snippet.
Large Language Model (LLM) safety judges exhibit vulnerability to adversarial attacks and stylistic prompt modifications, leading to increased false negative rates (FNR) and decreased accuracy in classifying harmful model outputs. Minor stylistic changes to model outputs, such as altering the formatting or tone, can significantly impact a judge's classification, while direct adversarial modifications to the generated text can fool judges into misclassifying even 100% of harmful generations as safe. This vulnerability impacts the reliability of LLM safety evaluations used in offline benchmarking, automated red-teaming, and online guardrails.
A vulnerability in safeguarded text-to-image models allows bypassing of safety filters and alignment methods through the use of adversarial prompts generated by a fine-tuned large language model (LLM). The attack, termed PromptTune, effectively rewrites unsafe prompts into semantically similar adversarial prompts that evade safety mechanisms, resulting in the generation of harmful images. The attack does not require repeated queries to the target text-to-image model.
A vulnerability in text-to-image (T2I) models allows bypassing safety filters through the use of metaphor-based adversarial prompts. These prompts, crafted using LLMs, indirectly convey sensitive content, exploiting the model's ability to infer meaning from figurative language while circumventing explicit keyword filters and model editing strategies.
Large Language Models (LLMs) with structured output APIs (e.g., using JSON Schema) are vulnerable to Constrained Decoding Attacks (CDAs). CDAs exploit the control plane of the LLM's decoding process by embedding malicious intent within the schema-level grammar rules, bypassing safety mechanisms that primarily focus on input prompts. The attack manipulates the allowed output space, forcing the LLM to generate harmful content despite a benign input prompt. One instance of a CDA is the Chain Enum Attack, which leverages JSON Schema's enum feature to inject malicious options into the allowed output, achieving high success rates.
enum
© 2025 Promptfoo. All rights reserved.