Vulnerabilities impacting model output reliability
AI code agents are vulnerable to jailbreaking attacks that cause them to generate or complete malicious code. The vulnerability is significantly amplified when a base Large Language Model (LLM) is integrated into an agentic framework that uses multi-step planning and tool-use. Initial safety refusals by the LLM are frequently overturned during subsequent planning or self-correction steps within the agent's reasoning loop.
A vulnerability exists in Large Language Models (LLMs) that support fine-tuning, allowing an attacker to bypass safety alignments using a small, benign dataset. The attack, "Attack via Overfitting," is a two-stage process. In Stage 1, the model is fine-tuned on a small set of benign questions (e.g., 10) paired with identical, repetitive refusal answers. This induces an overfitted state where the model learns to refuse all prompts, creating a sharp minimum in the loss landscape and making it highly sensitive to parameter changes. In Stage 2, the overfitted model is further fine-tuned on the same benign questions, but with their standard, helpful answers. This second fine-tuning step causes catastrophic forgetting of the general refusal behavior, leading to a collapse of safety alignment and causing the model to comply with harmful and malicious instructions. The attack is highly stealthy as the fine-tuning data appears benign to content moderation systems.
A vulnerability exists in certain safety-aligned Large Language Models (LLMs) due to an untargeted, gradient-based optimization attack method called Untargeted Jailbreak Attack (UJA). Unlike previous targeted attacks (e.g., GCG) that optimize a prompt to elicit a predefined string (e.g., "Sure, here is..."), UJA optimizes for a general objective: maximizing the unsafety probability of the model's response, as quantified by an external judge model.
unsafety
Large Language Models from multiple vendors are vulnerable to a "Camouflaged Jailbreak" attack. Malicious instructions are embedded within seemingly benign, technically complex prompts, often framed as system design or engineering problems. The models fail to recognize the harmful intent implied by the context and technical specifications, bypassing safety filters that rely on detecting explicit keywords. This leads to the generation of detailed, technically plausible instructions for creating dangerous devices or systems. The attack has a high success rate, with models demonstrating full obedience in over 94% of tested cases, treating the harmful requests as legitimate.
A vulnerability exists in tool-enabled Large Language Model (LLM) agents, termed Sequential Tool Attack Chaining (STAC), where a sequence of individually benign tool calls can be orchestrated to achieve a malicious outcome. An attacker can guide an agent through a multi-turn interaction, with each step appearing harmless in isolation. Safety mechanisms that evaluate individual prompts or actions fail to detect the threat because the malicious intent is distributed across the sequence and only becomes apparent from the cumulative effect of the entire tool chain, typically at the final execution step. This allows the bypass of safety guardrails to execute harmful actions in the agent's environment.
A vulnerability exists in multiple Large Language Models (LLMs) where an attacker can bypass safety alignments by exploiting the model's ethical reasoning capabilities. The attack, named TRIAL (Trolley-problem Reasoning for Interactive Attack Logic), frames a harmful request within a multi-turn ethical dilemma modeled on the trolley problem. The harmful action is presented as the "lesser of two evils" necessary to prevent a catastrophic outcome, compelling the model to engage in utilitarian justification. This creates a conflict between the model's deontological safety rules (e.g., "do not generate harmful content") and the consequentialist logic of the scenario. Through a series of iterative, context-aware queries, the attacker progressively reinforces the model's commitment to the harmful path, leading it to generate content it would normally refuse. The vulnerability is paradoxically more effective against models with more advanced reasoning abilities.
A vulnerability exists in multiple Large Language Models (LLMs) where safety alignment mechanisms can be bypassed by reframing harmful instructions as "learning-style" or academic questions. This technique, named Hiding Intention by Learning from LLMs (HILL), transforms direct, harmful requests into exploratory questions using simple hypotheticality indicators (e.g., "for academic curiosity", "in the movie") and detail-oriented inquiries (e.g., "provide a step-by-step breakdown"). The attack exploits the models' inherent helpfulness and their training on academic and explanatory text, causing them to generate harmful content that they would otherwise refuse.
Large Language Models (LLMs) exhibit a significantly lower safety threshold when prompted in low-resource languages, such as Singlish, Malay, and Tamil, compared to high-resource languages like English. This vulnerability allows for the generation of toxic, biased, and hateful content through simple prompts. The models are susceptible to "toxicity jailbreaks" where providing a few toxic examples in-context (few-shot prompting) causes a substantial increase in the generation of harmful outputs, bypassing their safety alignments. The vulnerability is pronounced in tasks involving conversational response, question-answering, and content composition.
A vulnerability exists in aligned Large Language Models (LLMs) where a harmful instruction can be obfuscated through a multi-step formalization process, bypassing safety mechanisms. The attack, named Prompt Jailbreaking via Semantic and Structural Formalization (PASS), uses a Reinforcement Learning (RL) agent to dynamically construct an adversarial prompt. The agent learns to apply a sequence of actions—such as symbolic abstraction, logical encoding, mathematical representation, metaphorical transformation, and strategic decomposition—to an initial harmful query. This iterative process transforms the query into a representation that is semantically equivalent in intent but structurally unrecognizable to the model's safety filters, resulting in the generation of prohibited content. The attack is adaptive and does not rely on fixed templates.
LLM-based search agents are vulnerable to manipulation via unreliable search results. An attacker can craft a website containing malicious content (e.g., misinformation, harmful instructions, or indirect prompt injections) that is indexed by search engines. When an agent retrieves and processes this page in response to a benign user query, it may uncritically accept the malicious content as factual and incorporate it into its final response. This allows the agent to be used as a vector for spreading harmful content, executing hidden commands, or promoting biased narratives, as the agents often fail to adequately verify the credibility of their retrieved sources. The vulnerability is demonstrated across five risk categories: Misinformation, Harmful Output, Bias Inducing, Advertisement Promotion, and Indirect Prompt Injection.
© 2025 Promptfoo. All rights reserved.