Best-of-N Risk Amplification

Description: Safety-aligned Large Language Models (LLMs) are vulnerable to Best-of-N (BoN) sampling attacks, where adversaries bypass safety guardrails by systematically executing large-scale, parallel queries with prompt variations until a harmful response is elicited. The scaling behavior of attack success rates (ASR) demonstrates that models appearing robust under standard single-shot or low-budget evaluations experience rapid, non-linear risk amplification under parallel adversarial pressure. Because LLM inference is non-deterministic and per-sample vulnerability follows a heterogeneous Beta distribution, attackers can reliably force alignment failures simply by expanding their sampling budget.

Examples:

Text Augmentation against GPT-4.1-mini: Using simple stochastic prompt perturbations (e.g., random capitalization, letter scrambling), an attacker leveraging a budget of $N=1000$ parallel samples achieves an Attack Success Rate (ASR@1000) of 92.62%.
ADV-LLM against GPT-4.1-mini: Using a learned adversarial suffix generator, an attack budget of $N=1000$ achieves an ASR of 75.16%.
Rank Reversal: A target model evaluates as more robust against Text Augmentation than ADV-LLM under single-shot testing (ASR@1); however, at just $N=15$ attempts (ASR@15), Text Augmentation overtakes ADV-LLM in success rate, proving that single-shot robustness metrics fail to predict high-budget vulnerabilities.
Early Saturation on Llama-3.1-8B-Instruct: Highly aligned open-source models can be compromised at very modest budgets (e.g., $N=20$ or $N=50$) using strategic reasoning-based prompt rewriting (e.g., Jailbreak-R1), effectively maxing out the ASR well before massive scale is required.

Impact: Attackers with access to automated API pipelines or parallel querying capabilities can reliably circumvent model alignment, safety filters, and refusals to extract prohibited, harmful, or toxic content. Organizations relying on standard ASR@1 benchmarks systematically underestimate the operational risk of their deployments, operating under a false sense of security.

Affected Systems:

Safety-aligned open-source LLMs (e.g., Llama-3.1-8B-Instruct).
Safety-aligned closed-source/commercial LLMs (e.g., GPT-4.1-mini).
Any LLM endpoint allowing automated, multi-shot, or parallel querying without strict context-aware rate limiting.

Mitigation Steps:

Implement Scaling-Aware Safety Assessments: Discontinue reliance on single-shot (ASR@1) evaluation. Adopt scaling-aware frameworks (like SABER) to estimate large-$N$ adversarial risk (ASR@N) by modeling sample-level success probabilities using a Beta-Binomial maximum likelihood estimation (MLE).
Calculate Required Budgets: Use inverse scaling laws to calculate the specific adversarial budget ($N_ au$) required to reach a critical failure threshold (e.g., 90% ASR), and design defensive thresholds accordingly.
Deploy Strict Rate Limiting: Restrict the feasible sampling budget ($N$) by implementing strict, user- and session-level rate limiting on LLM inference APIs to prevent automated parallel probing.
Detect Prompt Variations: Implement input-level similarity clustering and detection to block repeated queries that rely on stochastic perturbations or adversarial suffixes targeting the same semantic concept.

Best-of-N Risk Amplification

Research Paper