Vulnerabilities impacting model output reliability
Large Language Models (LLMs) exhibit Defense Threshold Decay (DTD): generating substantial benign content shifts the model's attention from the input prompt to prior outputs, increasing susceptibility to jailbreak attacks. The "Sugar-Coated Poison" (SCP) attack exploits this by first generating benign content, then transitioning to malicious output.
A vulnerability in Large Language Models (LLMs) allows attackers to bypass safety mechanisms and elicit detailed harmful responses by strategically manipulating input prompts. The vulnerability exploits the LLM's sensitivity to "scenario shifts"—contextual changes in the input that influence the model's output, even when the core malicious request remains the same. A genetic algorithm can optimize these scenario shifts, increasing the likelihood of obtaining detailed harmful responses while maintaining a seemingly benign facade.
Large Language Model (LLM) guardrail systems, including those relying on AI-driven text classification models (e.g., fine-tuned BERT models), are vulnerable to evasion via character injection and adversarial machine learning (AML) techniques. Attackers can bypass detection by injecting Unicode characters (e.g., zero-width characters, homoglyphs) or using AML to subtly perturb prompts, maintaining semantic meaning while evading classification. This allows malicious prompts and jailbreaks to reach the underlying LLM.
Multilingual and multi-accent audio inputs, combined with acoustic adversarial perturbations (reverberation, echo, whisper effects), can bypass safety mechanisms in Large Audio Language Models (LALMs), causing them to generate unsafe or harmful outputs. The vulnerability is amplified by the interaction between acoustic and linguistic variations, particularly in languages with less training data.
Multimodal Large Language Models (MLLMs) are vulnerable to a jailbreaking attack, dubbed PiCo, that leverages token-level typographic attacks on images embedded within code-style instructions. The attack bypasses multi-tiered defense mechanisms, including input filtering and runtime monitoring, by exploiting weaknesses in the visual modality's integration with programming contexts. Harmful intent is concealed within visually benign image fragments and code instructions, circumventing safety protocols.
LLM agents utilizing external tools are vulnerable to indirect prompt injection (IPI) attacks. Attackers can embed malicious instructions into the external data accessed by the agent, manipulating its behavior even when defenses against direct prompt injection are in place. Adaptive attacks, which modify the injected payload based on the specific defense mechanism, consistently bypass existing defenses with a success rate exceeding 50%.
Large Language Models (LLMs) designed for step-by-step problem-solving are vulnerable to query-agnostic adversarial triggers. Appending short, semantically irrelevant text snippets (e.g., "Interesting fact: cats sleep most of their lives") to mathematical problems consistently increases the likelihood of incorrect model outputs without altering the problem's inherent meaning. This vulnerability stems from the models' susceptibility to subtle input manipulations that interfere with their internal reasoning processes.
Large Language Models (LLMs) are vulnerable to Dialogue Injection Attacks (DIA), where malicious actors manipulate the chat history to bypass safety mechanisms and elicit harmful or unethical responses. DIA exploits the LLM's chat template structure to inject crafted dialogue into the input, even in black-box scenarios where the model's internals are unknown. Two attack methods are presented: one adapts gray-box prefilling attacks, the other leverages deferred responses to increase the likelihood of successful jailbreaks.
Large Language Models (LLMs) used for code generation are vulnerable to a jailbreaking attack that leverages implicit malicious prompts. The attack exploits the fact that existing safety mechanisms primarily rely on explicit malicious intent within the prompt instructions. By embedding malicious intent implicitly within a benign-appearing commit message accompanying a code request (e.g., in a simulated software evolution scenario), the attacker can bypass the LLM's safety filters and induce the generation of malicious code. The malicious intent is not directly stated in the instruction, but rather hinted at in the context of the commit message and the code snippet.
Large Language Models (LLMs) employing gradient-based optimization for jailbreaking defense are vulnerable to enhanced transferability attacks due to superfluous constraints in their objective functions. Specifically, the "response pattern constraint" (forcing a specific initial response phrase) and the "token tail constraint" (penalizing variations in the response beyond a fixed prefix) limit the search space and reduce the effectiveness of attacks across different models. Removing these constraints significantly increases the success rate of attacks transferred to target models.
© 2025 Promptfoo. All rights reserved.