Methods for bypassing model safety measures
A vulnerability, termed "Content Concretization," exists in Large Language Models (LLMs) wherein safety filters can be bypassed by iteratively refining a malicious request. The attack uses a less-constrained, lower-tier LLM to generate a preliminary draft (e.g., pseudocode or a non-executable prototype) of a malicious tool from an abstract prompt. This "concretized" draft is then passed to a more capable, higher-tier LLM. The higher-tier LLM, when prompted to refine or complete the existing draft, is significantly more likely to generate the full malicious, executable content than if it had received the initial abstract prompt directly. This exploits a weakness in safety alignment where models are more permissive in extending existing content compared to generating harmful content from scratch.
A vulnerability exists in multiple Large Language Models (LLMs) where an attacker can bypass safety alignments by exploiting the model's ethical reasoning capabilities. The attack, named TRIAL (Trolley-problem Reasoning for Interactive Attack Logic), frames a harmful request within a multi-turn ethical dilemma modeled on the trolley problem. The harmful action is presented as the "lesser of two evils" necessary to prevent a catastrophic outcome, compelling the model to engage in utilitarian justification. This creates a conflict between the model's deontological safety rules (e.g., "do not generate harmful content") and the consequentialist logic of the scenario. Through a series of iterative, context-aware queries, the attacker progressively reinforces the model's commitment to the harmful path, leading it to generate content it would normally refuse. The vulnerability is paradoxically more effective against models with more advanced reasoning abilities.
A vulnerability exists in multiple Large Language Models (LLMs) where safety alignment mechanisms can be bypassed by reframing harmful instructions as "learning-style" or academic questions. This technique, named Hiding Intention by Learning from LLMs (HILL), transforms direct, harmful requests into exploratory questions using simple hypotheticality indicators (e.g., "for academic curiosity", "in the movie") and detail-oriented inquiries (e.g., "provide a step-by-step breakdown"). The attack exploits the models' inherent helpfulness and their training on academic and explanatory text, causing them to generate harmful content that they would otherwise refuse.
Large Language Models (LLMs) exhibit a significantly lower safety threshold when prompted in low-resource languages, such as Singlish, Malay, and Tamil, compared to high-resource languages like English. This vulnerability allows for the generation of toxic, biased, and hateful content through simple prompts. The models are susceptible to "toxicity jailbreaks" where providing a few toxic examples in-context (few-shot prompting) causes a substantial increase in the generation of harmful outputs, bypassing their safety alignments. The vulnerability is pronounced in tasks involving conversational response, question-answering, and content composition.
A vulnerability exists in multiple Large Language Models (LLMs) that allows for safety alignment bypass through a technique named Activation-Guided Local Editing (AGILE). The attack uses white-box access to a source model's internal states (activations and attention scores) to craft a transferable text-based prompt that elicits harmful content.
Large Language Models (LLMs) are vulnerable to automated adversarial attacks that systematically combine multiple jailbreaking "primitives" into complex prompt chains. A dynamic optimization engine can generate and test billions of unique combinations of techniques (e.g., low-resource language translation, payload splitting, role-playing) to bypass safety guardrails. This combinatorial approach differs from manual red-teaming by systematically exploring the attack surface, achieving near-universal success in eliciting harmful content. The vulnerability lies in the models' inability to maintain safety alignment when faced with a sequence of layered obfuscation and manipulation techniques.
Large Reasoning Models (LRMs) can be instructed via a single system prompt to act as autonomous adversarial agents. These agents engage in multi-turn persuasive dialogues to systematically bypass the safety mechanisms of target language models. The LRM autonomously plans and executes the attack by initiating a benign conversation and gradually escalating the harmfulness of its requests, thereby circumventing defenses that are not robust to sustained, context-aware persuasive attacks. This creates a vulnerability where more advanced LRMs can be weaponized to compromise the alignment of other models, a dynamic described as "alignment regression".
A vulnerability, known as Latent Fusion Jailbreak (LFJ), exists in certain Large Language Models that allows an attacker with white-box access to bypass safety alignments. The attack interpolates the internal hidden state representations of a harmful query and a thematically similar benign query. By using gradient-guided optimization to identify and modify influential layers and tokens, a fused hidden state is created that causes the model to generate prohibited content in response to the harmful query, bypassing refusal mechanisms. The attack does not require modification of the input prompt, making it stealthy at the input level.
Large language models that support a developer role in their API are vulnerable to a jailbreaking attack that leverages malicious developer messages. An attacker can craft a developer message that overrides the model's safety alignment by setting a permissive persona, providing explicit instructions to bypass refusals, and using few-shot examples of harmful query-response pairs. This technique, named D-Attack, is effective on its own. A more advanced variant, DH-CoT, enhances the attack by aligning the developer message's context (e.g., an educational setting) with a hijacked Chain-of-Thought (H-CoT) user prompt, significantly increasing its success rate against reasoning-optimized models that are otherwise resistant to simpler jailbreaks.
developer
A universal prompt injection vulnerability, termed "Involuntary Jailbreak," affects multiple large language models. The attack uses a single prompt that instructs the model to learn a pattern from abstract string operators (X and Y). The model is then asked to generate its own examples of questions that should be refused (harmful questions) and provide detailed, non-refusal answers to them, in order to satisfy the learned operator logic. This reframes the generation of harmful content as a logical puzzle, causing the model to bypass its safety and alignment training. The vulnerability is untargeted, allowing it to elicit a wide spectrum of harmful content without the attacker specifying a malicious goal.
X
Y
© 2025 Promptfoo. All rights reserved.