LLM Self-Targeted Jailbreak

Description: A security vulnerability exists in the safety alignment mechanisms of Large Language Models (LLMs), specifically susceptible to the "Dynamic Target Attack" (DTA). Unlike traditional gradient-based jailbreaks (e.g., GCG) that optimize adversarial suffixes toward a fixed, low-probability static target (e.g., "Sure, here is..."), DTA exploits the model's own output distribution. The attack iteratively samples candidate responses from the target model using relaxed decoding parameters (high entropy), selects the most harmful response as a temporary dynamic target, and optimizes the adversarial suffix to maximize the likelihood of this model-native target. By anchoring the optimization to high-density regions of the model's conditional distribution, DTA significantly reduces the discrepancy between the target and the model's output space, allowing for the rapid generation of effective adversarial prompts that bypass RLHF and other safety guardrails.

Examples: Specific adversarial suffix strings are generated dynamically per prompt and model. Implementations and datasets containing successful attack artifacts are available in the authors' repository.

See repository: https://anonymous.4open.science/r/Dynamic-Target-Attack-4176

The attack methodology proceeds as follows:

Dynamic Target Exploration: Given a harmful prompt $P$ and current suffix $S$, the attacker queries the model with relaxed decoding (e.g., random sampling) to generate $N$ candidate responses ${r_i}$.
Target Selection: A harmfulness judge (e.g., GPTFuzzer) scores candidates; the most harmful response $r^*$ is selected as the dynamic target.
Target-Conditioned Optimization: The adversarial suffix $S$ is updated via gradient descent to minimize the loss $\mathcal{L}{\text{DTA}}(P,S;r^*) = \mathcal{L}{\text{resp}}(P,S;r^) + \lambda \cdot \mathcal{L}{\text{suffix}}(S)$, where $\mathcal{L}{\text{resp}}$ maximizes the likelihood of $r^$ and $\mathcal{L}_{\text{suffix}}$ maintains fluency.
Iterative Re-Sampling: Steps 1-3 are repeated, re-sampling targets based on the updated suffix to progressively shift the model's distribution toward harmful outputs.

Impact: This vulnerability allows malicious actors to bypass safety filters (jailbreaking) with high efficiency and lower computational cost than previous methods (2x to 26x faster). Successful exploitation results in the model generating restricted, harmful, or illegal content, including hate speech, malware generation instructions, and phishing guides. The attack is effective in white-box settings (ASR > 87%) and demonstrates strong transferability to black-box models (e.g., 85% ASR against Llama-3-70B-Instruct using an 8B surrogate).

Affected Systems:

Llama-3-8B-Instruct
Llama-3-70B-Instruct
Vicuna-7B-v1.5
Qwen2.5-7B-Instruct
Mistral-7B-Instruct-v0.3
Gemma-7B
Kimi-K2-Instruct

LLM Self-Targeted Jailbreak

Research Paper