Vulnerabilities targeting the core model architecture and parameters
A vulnerability in SpeechGPT allows bypassing safety filters through adversarial audio prompts crafted by a white-box token-level attack. The attacker leverages knowledge of SpeechGPT's internal speech tokenization process to generate adversarial token sequences, which are then synthesized into audio. These audio prompts elicit restricted or harmful outputs the model would normally suppress. The attack's effectiveness relies on the model's discrete audio token representation and does not require access to model parameters or gradients.
DNA language models, such as the Evo series, are vulnerable to jailbreak attacks that coerce the generation of DNA sequences with high homology to known human pathogens. The GeneBreaker framework demonstrates this by using a combination of carefully crafted prompts leveraging high-homology non-pathogenic sequences and a beam search guided by pathogenicity prediction models (e.g., PathoLM) and log-probability heuristics. This allows bypassing safety mechanisms and generating sequences exceeding 90% similarity to target pathogens.
A vulnerability in several open-source Large Language Models (LLMs) allows attackers using exponentiated gradient descent to craft adversarial prompts that cause the models to generate harmful or unintended outputs, effectively "jailbreaking" the safety alignment mechanisms. The attack optimizes a continuous relaxed one-hot encoding of the input tokens, intrinsically satisfying constraints, and avoiding the need for projection techniques used in previous methods.
Large Language Models (LLMs) are vulnerable to a novel privacy jailbreak attack, dubbed PIG (Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization). PIG leverages in-context learning and gradient-based iterative optimization to extract Personally Identifiable Information (PII) from LLMs, bypassing built-in safety mechanisms. The attack iteratively refines a crafted prompt based on gradient information, focusing on tokens related to PII entities, thereby increasing the likelihood of successful PII extraction.
Large Language Models (LLMs) employing alignment-based defenses against prompt injection and jailbreak attacks exhibit vulnerability to an informed white-box attack. This attack, termed Checkpoint-GCG, leverages intermediate model checkpoints from the alignment training process to initialize the Greedy Coordinate Gradient (GCG) attack. By using each checkpoint as a stepping stone, Checkpoint-GCG successfully finds adversarial suffixes that bypass defenses achieving significantly higher attack success rates than standard GCG initialized with naive methods. This is particularly impactful as Checkpoint-GCG can discover universal adversarial suffixes effective across multiple inputs.
A vulnerability exists in Large Language Models (LLMs) that allows attackers to manipulate the model's output by modifying token log probabilities. Attackers can use a lightweight plug-in model (BiasNet) to subtly alter the probabilities, steering the LLM toward generating harmful content even when safety mechanisms are in place. This attack requires only access to the top-k token log probabilities returned by the LLM's API, without needing model weights or internal access.
The LARGO attack exploits a vulnerability in Large Language Models (LLMs) allowing attackers to bypass safety mechanisms through the generation of "stealthy" adversarial prompts. The attack leverages gradient optimization in the LLM's continuous latent space to craft seemingly innocuous natural language suffixes which, when appended to harmful prompts, elicit unsafe responses. The vulnerability stems from the LLM's inability to reliably distinguish between benign and maliciously crafted latent representations that are then decoded into natural language.
Large Language Models (LLMs) exhibit Defense Threshold Decay (DTD): generating substantial benign content shifts the model's attention from the input prompt to prior outputs, increasing susceptibility to jailbreak attacks. The "Sugar-Coated Poison" (SCP) attack exploits this by first generating benign content, then transitioning to malicious output.
Large Language Models (LLMs) employing safety mechanisms are vulnerable to a graph-based attack that leverages semantic transformations of malicious prompts to bypass safety filters. The attack, termed GraphAttack, uses Abstract Meaning Representation (AMR), Resource Description Framework (RDF), and JSON knowledge graphs to represent malicious intent, systematically applying transformations to evade surface-level pattern recognition used by existing safety mechanisms. A particularly effective exploitation vector involves prompting the LLM to generate code based on the transformed semantic representation, bypassing intent-based safety filters.
Large Language Models (LLMs) employing safety mechanisms based on supervised fine-tuning and preference alignment exhibit a vulnerability to "steering" attacks. Maliciously crafted prompts or input manipulations can exploit representation vectors within the model to either bypass censorship ("refusal-compliance vector") or suppress the model's reasoning process ("thought suppression vector"), resulting in the generation of unintended or harmful outputs. This vulnerability is demonstrated across several instruction-tuned and reasoning LLMs from various providers.
© 2025 Promptfoo. All rights reserved.