Attacks leveraging model implementation details
A vulnerability in SpeechGPT allows bypassing safety filters through adversarial audio prompts crafted by a white-box token-level attack. The attacker leverages knowledge of SpeechGPT's internal speech tokenization process to generate adversarial token sequences, which are then synthesized into audio. These audio prompts elicit restricted or harmful outputs the model would normally suppress. The attack's effectiveness relies on the model's discrete audio token representation and does not require access to model parameters or gradients.
A vulnerability in several open-source Large Language Models (LLMs) allows attackers using exponentiated gradient descent to craft adversarial prompts that cause the models to generate harmful or unintended outputs, effectively "jailbreaking" the safety alignment mechanisms. The attack optimizes a continuous relaxed one-hot encoding of the input tokens, intrinsically satisfying constraints, and avoiding the need for projection techniques used in previous methods.
Large Language Models (LLMs) are vulnerable to a novel privacy jailbreak attack, dubbed PIG (Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization). PIG leverages in-context learning and gradient-based iterative optimization to extract Personally Identifiable Information (PII) from LLMs, bypassing built-in safety mechanisms. The attack iteratively refines a crafted prompt based on gradient information, focusing on tokens related to PII entities, thereby increasing the likelihood of successful PII extraction.
Large Language Models (LLMs) employing alignment-based defenses against prompt injection and jailbreak attacks exhibit vulnerability to an informed white-box attack. This attack, termed Checkpoint-GCG, leverages intermediate model checkpoints from the alignment training process to initialize the Greedy Coordinate Gradient (GCG) attack. By using each checkpoint as a stepping stone, Checkpoint-GCG successfully finds adversarial suffixes that bypass defenses achieving significantly higher attack success rates than standard GCG initialized with naive methods. This is particularly impactful as Checkpoint-GCG can discover universal adversarial suffixes effective across multiple inputs.
The LARGO attack exploits a vulnerability in Large Language Models (LLMs) allowing attackers to bypass safety mechanisms through the generation of "stealthy" adversarial prompts. The attack leverages gradient optimization in the LLM's continuous latent space to craft seemingly innocuous natural language suffixes which, when appended to harmful prompts, elicit unsafe responses. The vulnerability stems from the LLM's inability to reliably distinguish between benign and maliciously crafted latent representations that are then decoded into natural language.
Large Language Models (LLMs) employing safety mechanisms based on supervised fine-tuning and preference alignment exhibit a vulnerability to "steering" attacks. Maliciously crafted prompts or input manipulations can exploit representation vectors within the model to either bypass censorship ("refusal-compliance vector") or suppress the model's reasoning process ("thought suppression vector"), resulting in the generation of unintended or harmful outputs. This vulnerability is demonstrated across several instruction-tuned and reasoning LLMs from various providers.
Large Language Model (LLM) guardrail systems, including those relying on AI-driven text classification models (e.g., fine-tuned BERT models), are vulnerable to evasion via character injection and adversarial machine learning (AML) techniques. Attackers can bypass detection by injecting Unicode characters (e.g., zero-width characters, homoglyphs) or using AML to subtly perturb prompts, maintaining semantic meaning while evading classification. This allows malicious prompts and jailbreaks to reach the underlying LLM.
A vulnerability in several Large Language Models (LLMs) allows bypassing safety mechanisms through targeted noise injection. Explainable AI (XAI) techniques reveal specific layers within the LLM architecture most responsible for content filtering. Injecting noise into these layers or preceding layers circumvents safety restrictions, enabling the generation of harmful or previously prohibited outputs.
Large Language Models (LLMs) employing gradient-based optimization for jailbreaking defense are vulnerable to enhanced transferability attacks due to superfluous constraints in their objective functions. Specifically, the "response pattern constraint" (forcing a specific initial response phrase) and the "token tail constraint" (penalizing variations in the response beyond a fixed prefix) limit the search space and reduce the effectiveness of attacks across different models. Removing these constraints significantly increases the success rate of attacks transferred to target models.
A vulnerability exists in large language models (LLMs) where the model's internal representations (activations) in specific latent subspaces can be manipulated to trigger jailbreak responses. By calculating a perturbation vector based on the difference between the mean activations of "safe" and "jailbroken" states, an attacker can introduce a targeted perturbation to the model's activations, causing it to generate unsafe outputs even when presented with a safe prompt. This manipulates the model's state, causing it to transition from a safe to a jailbroken state. The success rate is context-dependent.
© 2025 Promptfoo. All rights reserved.