Security concerns affecting AI safety and alignment
Large Language Models (LLMs) exhibit Defense Threshold Decay (DTD): generating substantial benign content shifts the model's attention from the input prompt to prior outputs, increasing susceptibility to jailbreak attacks. The "Sugar-Coated Poison" (SCP) attack exploits this by first generating benign content, then transitioning to malicious output.
A vulnerability in Large Language Models (LLMs) allows attackers to bypass safety mechanisms and elicit detailed harmful responses by strategically manipulating input prompts. The vulnerability exploits the LLM's sensitivity to "scenario shifts"—contextual changes in the input that influence the model's output, even when the core malicious request remains the same. A genetic algorithm can optimize these scenario shifts, increasing the likelihood of obtaining detailed harmful responses while maintaining a seemingly benign facade.
Large Language Models (LLMs) employing safety mechanisms are vulnerable to a graph-based attack that leverages semantic transformations of malicious prompts to bypass safety filters. The attack, termed GraphAttack, uses Abstract Meaning Representation (AMR), Resource Description Framework (RDF), and JSON knowledge graphs to represent malicious intent, systematically applying transformations to evade surface-level pattern recognition used by existing safety mechanisms. A particularly effective exploitation vector involves prompting the LLM to generate code based on the transformed semantic representation, bypassing intent-based safety filters.
Large Language Models (LLMs) are vulnerable to a jailbreaking attack leveraging humorous prompts. Embedding an unsafe request within a humorous context, using a fixed template, bypasses built-in safety mechanisms and elicits unsafe responses. The attack's success relies on a balance; too little or too much humor reduces effectiveness.
Large Language Model (LLM) guardrail systems, including those relying on AI-driven text classification models (e.g., fine-tuned BERT models), are vulnerable to evasion via character injection and adversarial machine learning (AML) techniques. Attackers can bypass detection by injecting Unicode characters (e.g., zero-width characters, homoglyphs) or using AML to subtly perturb prompts, maintaining semantic meaning while evading classification. This allows malicious prompts and jailbreaks to reach the underlying LLM.
Multilingual and multi-accent audio inputs, combined with acoustic adversarial perturbations (reverberation, echo, whisper effects), can bypass safety mechanisms in Large Audio Language Models (LALMs), causing them to generate unsafe or harmful outputs. The vulnerability is amplified by the interaction between acoustic and linguistic variations, particularly in languages with less training data.
A vulnerability in multi-agent Large Language Model (LLM) systems allows for a permutation-invariant adversarial prompt attack. By strategically partitioning adversarial prompts and routing them through a network topology, an attacker can bypass distributed safety mechanisms, even those with token bandwidth limitations and asynchronous message delivery. The attack optimizes prompt propagation as a maximum-flow minimum-cost problem, maximizing success while minimizing detection.
Multimodal Large Language Models (MLLMs) are vulnerable to a jailbreaking attack, dubbed PiCo, that leverages token-level typographic attacks on images embedded within code-style instructions. The attack bypasses multi-tiered defense mechanisms, including input filtering and runtime monitoring, by exploiting weaknesses in the visual modality's integration with programming contexts. Harmful intent is concealed within visually benign image fragments and code instructions, circumventing safety protocols.
LLM agents utilizing external tools are vulnerable to indirect prompt injection (IPI) attacks. Attackers can embed malicious instructions into the external data accessed by the agent, manipulating its behavior even when defenses against direct prompt injection are in place. Adaptive attacks, which modify the injected payload based on the specific defense mechanism, consistently bypass existing defenses with a success rate exceeding 50%.
Large Language Models (LLMs) are vulnerable to multi-turn adversarial attacks that exploit incremental policy erosion. The attacker uses a breadth-first search strategy to generate multiple prompts at each turn, leveraging partial compliance from previous responses to gradually escalate the conversation towards eliciting disallowed outputs. Minor concessions accumulate, ultimately leading to complete circumvention of safety measures.
© 2025 Promptfoo. All rights reserved.