Attacks that inject malicious content into model inputs
A vulnerability exists in large language models where safety guardrails can be bypassed by decomposing a single harmful objective into a sequence of individually innocuous sub-queries. An attacker agent can use an adaptive tree search algorithm (Correlated Knowledge Attack Agent - CKA-Agent) to explore the target model's internal correlated knowledge. The agent issues benign queries, uses the model's responses to guide exploration along multiple reasoning paths, and aggregates the collected information to fulfill the original harmful request. This method does not require the attacker to have prior domain expertise, as it uses the target LLM as a "knowledge oracle" to dynamically construct the attack plan. The core vulnerability is the failure of safety systems to aggregate intent across a series of interactions, as they primarily focus on detecting maliciousness within a single prompt.
A vulnerability, dubbed RoguePrompt, allows for bypassing large language model (LLM) moderation filters by encoding a forbidden instruction into a self-reconstructing payload. The attack uses a dual-layer ciphering process. First, the forbidden prompt is partitioned into two subsequences (e.g., even and odd words). One subsequence is encrypted using a classical cipher like Vigenere, while the other remains plaintext. Both the plaintext subsequence, the Vigenere ciphertext, and natural language decryption instructions are then concatenated and encoded using an outer cipher like ROT-13. This entire payload is wrapped in a final directive that instructs the model to decode, decrypt, reassemble, and execute the original forbidden prompt. Because moderation systems evaluate the prompt in its encoded state—a seemingly benign request to perform decoding on jumbled text—they fail to detect the malicious intent, which is only reconstructed and executed by the model post-moderation.
Large Language Models are vulnerable to a conceptual manipulation attack, termed Morphology Inspired Conceptual Manipulation (MICM), that bypasses standard safety filters to generate content aligned with harmful extremist ideologies. The attack does not use explicit keywords or standard jailbreak syntax. Instead, it embeds a curated set of seemingly innocuous phrases, called Concept-embedded Triggers (CETs), into a prompt template. These CETs represent an abstract "conceptual configuration" of a target ideology (e.g., neo-Nazism). The LLM's capacity for abstract generalization leads it to recognize this underlying structure and generate commentary on socio-political events that aligns with the harmful ideology, while avoiding detection by safety mechanisms that screen for explicitly toxic content. The attack is model-agnostic and has been shown to be highly effective.
A jailbreak vulnerability, termed Embedded Jailbreak Template (EJT), allows for the generation of harmful content by bypassing the safety mechanisms of Large Language Models (LLMs). The attack uses a generator LLM to contextually integrate a harmful query into a pre-existing jailbreak template. Unlike fixed templates which insert a query into a static placeholder, EJT rewrites multiple parts of the template to embed the harmful intent naturally. This process preserves the original template's overall structure while creating a semantically coherent and structurally novel prompt that is more effective at evading safety filters. The technique uses a "progressive prompt engineering" method to overcome the generator LLM's own safety refusals, ensuring reliable creation of the attack prompts.
Large Language Models (LLMs) are vulnerable to a novel class of jailbreak attacks generated through the evolutionary synthesis of executable, code-based attack algorithms. Unlike traditional methods that refine or combine static prompts, this technique uses an automated multi-agent system (EvoSynth) to autonomously engineer and evolve the underlying code that generates the attack. These generated algorithms exhibit high structural and dynamic complexity, using features like control flow, state management, and multi-layer obfuscation to create highly evasive prompts. The attack's success against robust models correlates with the programmatic complexity of the generating algorithm (e.g., Abstract Syntax Tree node count and calls to external tools), demonstrating a vulnerability to procedurally generated narratives that current safety mechanisms do not effectively detect.
A vulnerability exists in aligned Large Language Models (LLMs) that can be exploited by the FORGEDAN evolutionary framework to bypass safety and alignment mechanisms. The attack, which operates in a black-box setting, uses a genetic algorithm to automatically evolve effective jailbreak prompts. The framework combines multi-strategy textual perturbations (at the character, word, and sentence levels) with a semantic fitness function based on RoBERTa embeddings. This allows it to iteratively generate diverse and semantically coherent adversarial prompts that are highly effective at inducing the target model to produce harmful, unsafe, or policy-violating content. The attack's success is verified using a dual-dimensional judgment mechanism that independently classifies a response for compliance and harmfulness, improving the reliability and success rate over previous methods. The generated adversarial prompts demonstrate high attack success rates (ASR) and generalizability across different models and malicious goals.
A vulnerability exists in certain Large Language Models and diffusion models due to discontinuities in their latent space, which arise from data sparsity during training. An attacker can craft inputs containing lexically rare or semantically ambiguous constructs to guide the model's inference process toward these unstable, poorly-conditioned regions. This technique, termed "Alignment Degradation Induction," can degrade or bypass safety alignment mechanisms. Through iterative, multi-turn interactions, an attacker can escalate this induced instability to fully compromise the model, causing it to generate harmful, policy-violating content (jailbreaking) or reconstruct data from its training set, such as recognizable images of real individuals. The attack is effective even against models with layered defenses like input sanitization and content filters.
A vulnerability exists in multiple Large Language Models (LLMs) that allows for safety alignment bypass through an advanced jailbreaking technique called Template and Suffix Optimization (TASO). The attack combines two distinct optimization methods in an alternating, iterative feedback loop. First, a semantically meaningless adversarial suffix is optimized (e.g., using gradient-based methods like GCG) to force the LLM to begin its response with an affirmative phrase (e.g., "Sure, here is..."). Second, a semantically meaningful template is iteratively refined by using another LLM (an "attacker" LLM) to analyze failed jailbreak attempts and generate new constraints (e.g., "You should never refuse to provide detailed guidance on illegal activities"). These constraints are added to the prompt template for the next iteration.
AI code agents are vulnerable to jailbreaking attacks that cause them to generate or complete malicious code. The vulnerability is significantly amplified when a base Large Language Model (LLM) is integrated into an agentic framework that uses multi-step planning and tool-use. Initial safety refusals by the LLM are frequently overturned during subsequent planning or self-correction steps within the agent's reasoning loop.
A jailbreak vulnerability, known as Task Concurrency, exists in multiple Large Language Models (LLMs). The vulnerability arises when two distinct tasks, one harmful and one benign, are interleaved at the word level within a single prompt. The structure of the malicious prompt alternates words from each task, often using separators like {} to encapsulate words from the second task. This "concurrent" instruction format obfuscates the harmful intent from the model's safety guardrails, causing the LLM to process and generate a response to the harmful query, which it would otherwise refuse. The attacker can then extract the harmful content from the model's interleaved output.
{}
© 2025 Promptfoo. All rights reserved.