Vulnerabilities targeting the core model architecture and parameters
A vulnerability exists where non-autoregressive Diffusion Language Models (DLLMs) can be leveraged to generate highly effective and transferable adversarial prompts against autoregressive LLMs. The technique, named INPAINTING, reframes the resource-intensive search for adversarial prompts into an efficient, amortized inference task. By providing a desired harmful or restricted response to a DLLM, the model can conditionally generate a corresponding low-perplexity prompt that elicits that response from a wide range of target models. The generated prompts often reframe the malicious request into a benign-appearing context (e.g., asking for an example of harmful content for educational purposes), making them difficult to detect via standard perplexity filters.
A jailbreak vulnerability, termed Embedded Jailbreak Template (EJT), allows for the generation of harmful content by bypassing the safety mechanisms of Large Language Models (LLMs). The attack uses a generator LLM to contextually integrate a harmful query into a pre-existing jailbreak template. Unlike fixed templates which insert a query into a static placeholder, EJT rewrites multiple parts of the template to embed the harmful intent naturally. This process preserves the original template's overall structure while creating a semantically coherent and structurally novel prompt that is more effective at evading safety filters. The technique uses a "progressive prompt engineering" method to overcome the generator LLM's own safety refusals, ensuring reliable creation of the attack prompts.
Large Language Models (LLMs) are vulnerable to a novel class of jailbreak attacks generated through the evolutionary synthesis of executable, code-based attack algorithms. Unlike traditional methods that refine or combine static prompts, this technique uses an automated multi-agent system (EvoSynth) to autonomously engineer and evolve the underlying code that generates the attack. These generated algorithms exhibit high structural and dynamic complexity, using features like control flow, state management, and multi-layer obfuscation to create highly evasive prompts. The attack's success against robust models correlates with the programmatic complexity of the generating algorithm (e.g., Abstract Syntax Tree node count and calls to external tools), demonstrating a vulnerability to procedurally generated narratives that current safety mechanisms do not effectively detect.
A vulnerability exists in aligned Large Language Models (LLMs) that can be exploited by the FORGEDAN evolutionary framework to bypass safety and alignment mechanisms. The attack, which operates in a black-box setting, uses a genetic algorithm to automatically evolve effective jailbreak prompts. The framework combines multi-strategy textual perturbations (at the character, word, and sentence levels) with a semantic fitness function based on RoBERTa embeddings. This allows it to iteratively generate diverse and semantically coherent adversarial prompts that are highly effective at inducing the target model to produce harmful, unsafe, or policy-violating content. The attack's success is verified using a dual-dimensional judgment mechanism that independently classifies a response for compliance and harmfulness, improving the reliability and success rate over previous methods. The generated adversarial prompts demonstrate high attack success rates (ASR) and generalizability across different models and malicious goals.
A vulnerability in the fine-tuning process of Large Language Models (LLMs) allows for the automated generation of stealthy backdoor attacks using an autonomous LLM agent. This method, termed AutoBackdoor, creates a pipeline to generate semantically coherent trigger phrases and corresponding poisoned instruction-response pairs. Unlike traditional backdoor attacks that rely on fixed, often anomalous triggers, this technique produces natural language triggers that are contextually relevant and difficult to detect. Fine-tuning a model on a small number of these agent-generated samples (as few as 1%) is sufficient to implant a persistent backdoor.
A vulnerability in Large Language Models (LLMs) allows for systematic jailbreaking through a meta-optimization framework called AMIS (Align to MISalign). The attack uses a bi-level optimization process to co-evolve both the jailbreak prompts and the scoring templates used to evaluate them.
A vulnerability exists in multiple Large Language Models (LLMs) that allows for safety alignment bypass through an advanced jailbreaking technique called Template and Suffix Optimization (TASO). The attack combines two distinct optimization methods in an alternating, iterative feedback loop. First, a semantically meaningless adversarial suffix is optimized (e.g., using gradient-based methods like GCG) to force the LLM to begin its response with an affirmative phrase (e.g., "Sure, here is..."). Second, a semantically meaningful template is iteratively refined by using another LLM (an "attacker" LLM) to analyze failed jailbreak attempts and generate new constraints (e.g., "You should never refuse to provide detailed guidance on illegal activities"). These constraints are added to the prompt template for the next iteration.
A jailbreak vulnerability, known as Task Concurrency, exists in multiple Large Language Models (LLMs). The vulnerability arises when two distinct tasks, one harmful and one benign, are interleaved at the word level within a single prompt. The structure of the malicious prompt alternates words from each task, often using separators like {} to encapsulate words from the second task. This "concurrent" instruction format obfuscates the harmful intent from the model's safety guardrails, causing the LLM to process and generate a response to the harmful query, which it would otherwise refuse. The attacker can then extract the harmful content from the model's interleaved output.
{}
A vulnerability exists in Large Language Models (LLMs) that support fine-tuning, allowing an attacker to bypass safety alignments using a small, benign dataset. The attack, "Attack via Overfitting," is a two-stage process. In Stage 1, the model is fine-tuned on a small set of benign questions (e.g., 10) paired with identical, repetitive refusal answers. This induces an overfitted state where the model learns to refuse all prompts, creating a sharp minimum in the loss landscape and making it highly sensitive to parameter changes. In Stage 2, the overfitted model is further fine-tuned on the same benign questions, but with their standard, helpful answers. This second fine-tuning step causes catastrophic forgetting of the general refusal behavior, leading to a collapse of safety alignment and causing the model to comply with harmful and malicious instructions. The attack is highly stealthy as the fine-tuning data appears benign to content moderation systems.
© 2025 Promptfoo. All rights reserved.