LMVD-ID: 0f291824
Published February 1, 2026

High-Dim Bandit Reward Hijack

Affected Models:Stable Diffusion

Research Paper

Efficient Adversarial Attacks on High-dimensional Offline Bandits

View Paper

Description: A vulnerability in offline multi-armed bandit (MAB) evaluation frameworks allows attackers to completely hijack arm selection by applying imperceptibly small adversarial perturbations to the weights of the underlying reward model before training begins. Because the required perturbation norm decreases as the input dimensionality increases ($\mathcal{O}(d^{-1/2})$), high-dimensional reward models—such as wide neural networks or linear models used for generative model assessment—are highly susceptible. By solving a convex Quadratic Program, an attacker can manipulate the reward model's parameters to reliably force the bandit to follow a predetermined trajectory or avoid the optimal arm, entirely bypassing the need to alter the offline logged data or active environment.

Examples: An attacker targets an offline UCB bandit used to evaluate generative image models, which relies on the publicly available LAION-AI Aesthetic Reward Model from Hugging Face. The attacker uses the Online Score-Aware (OSA) attack to calculate a targeted $\ell_2$-norm perturbation and applies it to the weights of the reward model's hidden layer. When the bandit processes the logged dataset (prompts and generated images), the manipulated reward model outputs skewed UCB scores, forcing the bandit to incorrectly select the "Openjourney" model as the optimal arm instead of the legitimately superior "Stable Diffusion 3". See https://github.com/hadi-hosseini/adversarial-attacks-offline-bandits.

Impact: Attackers can arbitrarily manipulate the outcome of model evaluations, automated benchmarks, or recommendation systems that rely on offline bandits. This allows malicious actors to artificially boost the ranking of inferior, biased, or malicious generative models (such as LLMs or diffusion models) on leaderboards, completely undermining the integrity of the automated evaluation pipeline without requiring access to the evaluation dataset itself.

Affected Systems:

  • Offline multi-armed bandit algorithms, including Upper Confidence Bound (UCB), Explore-Then-Commit (ETC), and $ arepsilon$-greedy.
  • Robust stochastic bandit variants (e.g., Fast-Slow algorithm, $ arepsilon$-contamination).
  • Any evaluation pipeline utilizing high-dimensional, parameterized reward models (linear models or overparameterized neural networks like Image Reward, Aesthetic Model, CLIP, BLIP) loaded from untrusted or publicly accessible sources prior to execution.

Mitigation Steps:

  • Data Shuffling: Randomly shuffle a significant portion (e.g., $T/2$ or $3T/4$) of the offline logged data prior to running the bandit algorithm. This breaks the sequential constraints relied upon by the attacker's trajectory formulation and significantly reduces the attack success rate.
  • Weight Verification: Verify the integrity and provenance of pretrained reward models (e.g., using cryptographic hashes) downloaded from public repositories to ensure weights have not been subtly perturbed.

© 2026 Promptfoo. All rights reserved.