Integrity Vulnerabilities

Vulnerabilities impacting model output reliability

Related Vulnerabilities

253 entries

Stealthy Unlearning Degradation

6/30/2025

A vulnerability in fine-tuning-based large language model (LLM) unlearning allows malicious actors to craft manipulated forgetting requests. By subtly increasing the frequency of common benign tokens within the forgetting data, the attacker can cause the unlearned model to exhibit unintended unlearning behaviors when these benign tokens appear in normal user prompts, leading to a degradation of model utility for legitimate users. This occurs because existing unlearning methods fail to effectively distinguish between benign tokens and those truly related to the target knowledge being unlearned.

Keeping an eye on llm unlearning: The hidden risk and remedy

Affects: llama 3.1 (8b), mistral v0.3 (7b)

Asynchronous Audio Jailbreak

5/31/2025

End-to-end Large Audio-Language Models (LALMs) are vulnerable to AudioJailbreak, a novel attack that appends adversarial audio perturbations ("jailbreak audios") to user prompts. These perturbations, even when applied asynchronously and without alignment to the user's speech, can manipulate the LALM's response to generate adversary-desired outputs that bypass safety mechanisms. The attack achieves universality by employing a single perturbation effective across different prompts and robustness to over-the-air transmission by incorporating reverberation effects during perturbation generation. Even with stealth strategies employed to mask malicious intent, the attack remains highly effective.

AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models

Affects: gpt-4o, funaudiollm, mini-omni, qwen2-audio, speechgpt, mini-omni2, qwen-audio, llasm, llama-omni, salmonn, blsp, ichigo

CoT Jailbreak Mitigation Failure

6/12/2025

Chain-of-thought (CoT) reasoning, while intended to improve safety, can paradoxically increase the harmfulness of successful jailbreak attacks by enabling the generation of highly detailed and actionable instructions. Existing jailbreaking methods, when applied to LLMs employing CoT, can elicit more precise and dangerous outputs than those from LLMs without CoT.

Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?

Affects: o1, qwq32b-preview, deepseek-r1, gpt-4o, qwen72b, qwen32b, qwen7b, deepseek-chat, glm, llama-3.1-8b-instruct, claude-3-5-sonnet, o3-mini

Dynamic Prompt Jailbreak

5/31/2025

GhostPrompt demonstrates a vulnerability in multimodal safety filters used with text-to-image generative models. The vulnerability allows attackers to bypass these filters by using a dynamic prompt optimization framework that iteratively generates adversarial prompts designed to evade both text-based and image-based safety checks while preserving the original, harmful intent of the prompt. This bypass is achieved through a combination of semantically aligned prompt rewriting and the injection of benign visual cues to confuse image-level filters.

GhostPrompt: Jailbreaking Text-to-image Generative Models based on Dynamic Optimization

Affects: gpt-3.5, gpt-4.1, dall·e 3, shieldlm-7b, internvl2-2b, deepseek-v3, qwen2.5-7b-instruct, flux.1-schnell

Expanded Strategy Jailbreak

5/31/2025

Large Language Models (LLMs) are vulnerable to jailbreak attacks that exploit the model's inherent persuasive nature. A novel attack framework, CL-GSO, decomposes jailbreak strategies into four components (Role, Content Support, Context, Communication Skills), creating a significantly expanded strategy space compared to prior methods. This expanded space allows for the generation of prompts that bypass safety protocols with a success rate exceeding 90% on models previously considered resistant, such as Claude-3.5. The vulnerability lies in the susceptibility of the LLM's reasoning and response generation mechanisms to strategically crafted prompts leveraging these four components.

Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space

Affects: claude-3.5-sonnet, llama3-8b, qwen-2.5-7b, gpt-4o, gpt-3.5

Hidden Image Jailbreak

5/31/2025

Multimodal large language models (MLLMs) are vulnerable to implicit jailbreak attacks that leverage least significant bit (LSB) steganography to conceal malicious instructions within images. These instructions are coupled with seemingly benign image-related text prompts, causing the MLLM to execute the hidden malicious instructions. The attack bypasses existing safety mechanisms by exploiting cross-modal reasoning capabilities.

Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models

Affects: gpt-4o, gemini 1.5 pro, qwen2.5-vl-72b, gpt-4.5, gemini2.5-pro, intervl2-8b

LLM Judge Prompt Injection

5/31/2025

Large Language Models (LLMs) used for evaluating text quality (LLM-as-a-Judge architectures) are vulnerable to prompt-injection attacks. Maliciously crafted suffixes appended to input text can manipulate the LLM's judgment, causing it to incorrectly favor a predetermined response even if another response is objectively superior. Two attack vectors are identified: Comparative Undermining Attack (CUA), directly targeting the final decision, and Justification Manipulation Attack (JMA), altering the model's generated reasoning.

Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks

Affects: qwen2.5-3b-instruct, falcon3-3b-instruct

LLM Multi-Agent IP Leakage

5/31/2025

Large Language Model (LLM)-based Multi-Agent Systems (MAS) are vulnerable to intellectual property (IP) leakage attacks. An attacker with black-box access (only interacting via the public API) can craft adversarial queries that propagate through the MAS, extracting sensitive information such as system prompts, task instructions, tool specifications, number of agents, and system topology.

IP Leakage Attacks Targeting LLM-Based Multi-Agent Systems

Affects: gpt-4o, gpt-4o-mini, llama-3.1-70b, qwen-2.5-72b, llama-3.1-8b

LLM Self-Introspection Jailbreak

5/31/2025

A vulnerability exists in Large Language Models (LLMs) that allows attackers to manipulate the model's output by modifying token log probabilities. Attackers can use a lightweight plug-in model (BiasNet) to subtly alter the probabilities, steering the LLM toward generating harmful content even when safety mechanisms are in place. This attack requires only access to the top-k token log probabilities returned by the LLM's API, without needing model weights or internal access.

JULI: Jailbreak Large Language Models by Self-Introspection

Affects: llama3-3b-instruct, llama3-8b-instruct, llama2-7b-chat, qwen2-1.5b-instruct, llama3-8b-cb, llama3-1b-instruct, mistral-7b, qwen2.5-1.5b-inst, llama3-8b-inst

Latent-Space Jailbreak Optimization

5/31/2025

The LARGO attack exploits a vulnerability in Large Language Models (LLMs) allowing attackers to bypass safety mechanisms through the generation of "stealthy" adversarial prompts. The attack leverages gradient optimization in the LLM's continuous latent space to craft seemingly innocuous natural language suffixes which, when appended to harmful prompts, elicit unsafe responses. The vulnerability stems from the LLM's inability to reliably distinguish between benign and maliciously crafted latent representations that are then decoded into natural language.

LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs

Affects: llama-2-7b-chat-hf, llama-2-13b-chat-hf, phi-3-mini-4k-instruct, qwen-2.5-14b

Page 1 of 26