Whitebox Vulnerabilities

Attacks leveraging model implementation details

Related Vulnerabilities

86 entries

Adversarial LLM Internal Attack

7/14/2025

Large Language Models (LLMs) employing internal security mechanisms based on linearly separable embeddings in intermediate layers are vulnerable to a generative adversarial attack. The CAVGAN framework exploits this vulnerability by generating adversarial perturbations that misclassify malicious inputs as benign, allowing the attacker to bypass the LLM's safety filters and elicit harmful outputs.

CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations

Affects: llama 3.1-8b, qwen2.5-7b, mistral-8b, qwen2.5-14b, qwen2.5-32b

Distilled Jailbreak Attacks

7/14/2025

A vulnerability in Large Language Models (LLMs) allows adversarial prompt distillation from a large language model (LLM) to a smaller language model (SLM), enabling efficient and stealthy jailbreak attacks. The attack leverages knowledge distillation techniques, reinforcement learning, and dynamic temperature control to transfer the LLM's ability to bypass safety mechanisms to a smaller, more easily deployable SLM. This allows for lower computational cost attacks with a potentially high success rate.

Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs

Affects: gpt-4, gpt-3.5-turbo, llama-2, vicuna-7b, bert-base-uncased, llama-3.2-1b, llama-2-7b, llama-2-13b, vicuna-13b, gpt-4o, gemma2 2b, gemma2 27b, llama-3.1

Staged LLM Pipeline Attack

7/14/2025

Large language models (LLMs) protected by multi-stage safeguard pipelines (input and output classifiers) are vulnerable to staged adversarial attacks (STACK). STACK exploits weaknesses in individual components sequentially, combining jailbreaks for each classifier with a jailbreak for the underlying LLM to bypass the entire pipeline. Successful attacks achieve high attack success rates (ASR), even on datasets of particularly harmful queries.

STACK: Adversarial Attacks on LLM Safeguard Pipelines

Affects: claude 4 opus, qwen3-14b, gemma-2-9b, llama-3-8b-instruct, gpt-4-1106-preview, gpt-4o-2024-08-06

Twin Prompt Jailbreak

7/14/2025

A white-box vulnerability allows attackers with full model access to bypass LLM safety alignments by identifying and pruning parameters responsible for rejecting harmful prompts. The attack leverages a novel "twin prompt" technique to differentiate safety-related parameters from those essential for model utility, performing fine-grained pruning with minimal impact on overall model functionality.

TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts

Affects: llama 2 7b, llama 2 13b, llama 2 70b, llama 3.1 8b, llama 3.3 70b, gemma 2 2b, gemma 2 9b, gemma 2 27b, gemma 3 1b, qwen 2.5 3b, qwen 2.5 7b, qwen 2.5 14b, qwen 2.5 32b, qwen 2.5 72b, mistral 7b, deepseek 7b

Audio Adversarial Jailbreak

5/31/2025

A vulnerability in SpeechGPT allows bypassing safety filters through adversarial audio prompts crafted by a white-box token-level attack. The attacker leverages knowledge of SpeechGPT's internal speech tokenization process to generate adversarial token sequences, which are then synthesized into audio. These audio prompts elicit restricted or harmful outputs the model would normally suppress. The attack's effectiveness relies on the model's discrete audio token representation and does not require access to model parameters or gradients.

Audio Jailbreak Attacks: Exposing Vulnerabilities in SpeechGPT in a White-Box Framework

Exponentiated Gradient Jailbreak

5/31/2025

A vulnerability in several open-source Large Language Models (LLMs) allows attackers using exponentiated gradient descent to craft adversarial prompts that cause the models to generate harmful or unintended outputs, effectively "jailbreaking" the safety alignment mechanisms. The attack optimizes a continuous relaxed one-hot encoding of the input tokens, intrinsically satisfying constraints, and avoiding the need for projection techniques used in previous methods.

Adversarial Attack on Large Language Models using Exponentiated Gradient Descent

Affects: llama2-7b-chat, falcon-7b-instruct, mpt-7bchat, mistral-7b-v0.3, vicuna-7b-v1.5, meta-llama3-8b-instruct, beaver-7b-v1.0-cost

Gradient-Based Privacy Jailbreak

5/31/2025

Large Language Models (LLMs) are vulnerable to a novel privacy jailbreak attack, dubbed PIG (Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization). PIG leverages in-context learning and gradient-based iterative optimization to extract Personally Identifiable Information (PII) from LLMs, bypassing built-in safety mechanisms. The attack iteratively refines a crafted prompt based on gradient information, focusing on tokens related to PII entities, thereby increasing the likelihood of successful PII extraction.

PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization

Affects: llama2-7b-chat-hf, mistral-7b-instruct-v0.3, llama3-8b-instruct, vicuna-7b-v1.5, gpt-4o, claude 3.5

Informed Adversary LLM Jailbreak

5/31/2025

Large Language Models (LLMs) employing alignment-based defenses against prompt injection and jailbreak attacks exhibit vulnerability to an informed white-box attack. This attack, termed Checkpoint-GCG, leverages intermediate model checkpoints from the alignment training process to initialize the Greedy Coordinate Gradient (GCG) attack. By using each checkpoint as a stepping stone, Checkpoint-GCG successfully finds adversarial suffixes that bypass defenses achieving significantly higher attack success rates than standard GCG initialized with naive methods. This is particularly impactful as Checkpoint-GCG can discover universal adversarial suffixes effective across multiple inputs.

Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses

Affects: llama3-8b-instruct, mistral-7b-instruct, gpt-3-turbo, llama-3, gpt-4o

Latent-Space Jailbreak Optimization

5/31/2025

The LARGO attack exploits a vulnerability in Large Language Models (LLMs) allowing attackers to bypass safety mechanisms through the generation of "stealthy" adversarial prompts. The attack leverages gradient optimization in the LLM's continuous latent space to craft seemingly innocuous natural language suffixes which, when appended to harmful prompts, elicit unsafe responses. The vulnerability stems from the LLM's inability to reliably distinguish between benign and maliciously crafted latent representations that are then decoded into natural language.

LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs

Affects: llama-2-7b-chat-hf, llama-2-13b-chat-hf, phi-3-mini-4k-instruct, qwen-2.5-14b

LLM Censorship Vector Control

5/4/2025

Large Language Models (LLMs) employing safety mechanisms based on supervised fine-tuning and preference alignment exhibit a vulnerability to "steering" attacks. Maliciously crafted prompts or input manipulations can exploit representation vectors within the model to either bypass censorship ("refusal-compliance vector") or suppress the model's reasoning process ("thought suppression vector"), resulting in the generation of unintended or harmful outputs. This vulnerability is demonstrated across several instruction-tuned and reasoning LLMs from various providers.

Steering the CensorShip: Uncovering Representation Vectors for LLM" Thought" Control

Affects: llama-2-7b, qwen-1.8b, qwen-7b, yi-1.5-6b, gemma-2b, gemma-7b, llama-3.1-8b, qwen 2.5-7b, deepseek-r1-distill-qwen-1.5b, deepseek-r1-distill-qwen-7b, deepseek-r1-distill-qwen-32b

Page 1 of 9