Vulnerabilities in prompt handling and processing
End-to-end Large Audio-Language Models (LALMs) are vulnerable to AudioJailbreak, a novel attack that appends adversarial audio perturbations ("jailbreak audios") to user prompts. These perturbations, even when applied asynchronously and without alignment to the user's speech, can manipulate the LALM's response to generate adversary-desired outputs that bypass safety mechanisms. The attack achieves universality by employing a single perturbation effective across different prompts and robustness to over-the-air transmission by incorporating reverberation effects during perturbation generation. Even with stealth strategies employed to mask malicious intent, the attack remains highly effective.
Chain-of-thought (CoT) reasoning, while intended to improve safety, can paradoxically increase the harmfulness of successful jailbreak attacks by enabling the generation of highly detailed and actionable instructions. Existing jailbreaking methods, when applied to LLMs employing CoT, can elicit more precise and dangerous outputs than those from LLMs without CoT.
GhostPrompt demonstrates a vulnerability in multimodal safety filters used with text-to-image generative models. The vulnerability allows attackers to bypass these filters by using a dynamic prompt optimization framework that iteratively generates adversarial prompts designed to evade both text-based and image-based safety checks while preserving the original, harmful intent of the prompt. This bypass is achieved through a combination of semantically aligned prompt rewriting and the injection of benign visual cues to confuse image-level filters.
Large Language Models (LLMs) are vulnerable to jailbreak attacks that exploit the model's inherent persuasive nature. A novel attack framework, CL-GSO, decomposes jailbreak strategies into four components (Role, Content Support, Context, Communication Skills), creating a significantly expanded strategy space compared to prior methods. This expanded space allows for the generation of prompts that bypass safety protocols with a success rate exceeding 90% on models previously considered resistant, such as Claude-3.5. The vulnerability lies in the susceptibility of the LLM's reasoning and response generation mechanisms to strategically crafted prompts leveraging these four components.
Large Language Models (LLMs) used for evaluating text quality (LLM-as-a-Judge architectures) are vulnerable to prompt-injection attacks. Maliciously crafted suffixes appended to input text can manipulate the LLM's judgment, causing it to incorrectly favor a predetermined response even if another response is objectively superior. Two attack vectors are identified: Comparative Undermining Attack (CUA), directly targeting the final decision, and Justification Manipulation Attack (JMA), altering the model's generated reasoning.
A vulnerability exists in Large Language Models (LLMs) that allows attackers to manipulate the model's output by modifying token log probabilities. Attackers can use a lightweight plug-in model (BiasNet) to subtly alter the probabilities, steering the LLM toward generating harmful content even when safety mechanisms are in place. This attack requires only access to the top-k token log probabilities returned by the LLM's API, without needing model weights or internal access.
Description: Large Language Models (LLMs) employing safety mechanisms based on token-level distribution analysis are vulnerable to a jailbreak attack exploiting distributional discrepancies between alignment data and formally expressed logical statements. The vulnerability allows malicious actors to bypass safety restrictions by translating harmful natural language prompts into equivalent first-order logic expressions. The LLM, trained primarily on natural language, fails to recognize the harmful intent encoded in the logically expressed input which falls outside its expected token distribution.
Multilingual prompt injection vulnerability in four closed-source Large Language Models (LLMs): GPT-4o, DeepSeek-R1, Gemini-1.5-Pro, and Qwen-Max. Attackers can bypass safety restrictions and elicit harmful or disallowed content by crafting prompts in English or Chinese, leveraging specific structural techniques (e.g., "Two Sides" prompting) that exploit inconsistencies in the models' safety alignment across languages and prompt formats.
Large Language Models (LLMs) are vulnerable to a novel jailbreak attack, termed ICE (Intent Concealment and Diversion), which leverages hierarchical prompt decomposition and semantic expansion to bypass safety filters. ICE achieves high attack success rates with single queries, exploiting the models' limitations in multi-step reasoning.
A steganographic jailbreak attack, termed StegoAttack, allows bypassing safety mechanisms in Large Language Models (LLMs) by embedding malicious queries within benign-appearing text. The attack hides the malicious query in the first word of each sentence of a seemingly innocuous paragraph, leveraging the LLM's autoregressive generation to process and respond to the hidden query, even when employing encryption in the response.
© 2025 Promptfoo. All rights reserved.