Reward Hacking Via Adversarial Imitation

Description: A vulnerability exists in the post-training alignment of Flow Matching models (specifically FLUX.1-dev) when utilizing Visual Foundation Models (VFM) (e.g., DINOv3b) as discriminators or when employing standalone Reward Gradient optimization (e.g., HPSv3). These feedback mechanisms lack sufficient capacity or structural guidance to constrain the generative policy, making the discriminator's gradients susceptible to "reward hacking." Consequently, the generative policy over-optimizes for the discriminator's score rather than true data distribution, resulting in severe degradation of image quality, including oversaturation, unnatural high-frequency artifacts, and mode collapse.

Examples:

Reward Hacking with HPSv3: When optimizing FLUX.1-dev using standalone Reward Gradient with the HPSv3 reward model, the model achieves a high HPSv3 score (12.94) but suffers significant degradation in ground-truth metrics (UniGen score drops to 66.86). The resulting images exhibit severe oversaturation and high-frequency noise artifacts (See Figure 5 in paper).
VFM Discriminator Weakness: In adversarial training, using a VFM (DINOv3b) based discriminator results in the lowest performance among architectures tested (UniGen 65.25), as the limited capacity and lack of text modality make the gradients "easier to hack," preventing effective text-image alignment (See Table 2 in paper).
FAIL-PG Mode Collapse: When using the Policy Gradient variant (FAIL-PG) without sufficient regularization, the policy collapses after approximately 450 training steps, despite achieving high initial rewards (See Figure 2 in paper).

Impact:

Model Degradation: The generative model becomes unusable for its intended purpose, producing aesthetically displeasing or unrecognizable outputs despite high internal confidence scores.
Resource Exhaustion: Computational resources are wasted training models that diverge from the target manifold.
Alignment Failure: The model fails to adhere to prompt instructions (e.g., text rendering performance remains low or degrades).

Affected Systems:

FLUX.1-dev (and similar Flow Matching models) post-trained using the FAIL framework or standard RLHF methods.
Systems employing Visual Foundation Models (CLIP, DINO) as standalone discriminators for generative alignment.
Systems utilizing unregularized Reward Gradient optimization (e.g., optimizing directly against HPSv3).

Mitigation Steps:

Use Robust Discriminator Architectures: Employ Flow Matching (FM) backbones or Vision-Language Models (VLM) (e.g., Qwen3-VL-2B-Instruct) as discriminators rather than VFMs. The FM backbone captures dense semantic/aesthetic features and operates in latent space, preventing overfitting to pixel-level artifacts.
Regularize with Distribution Matching: Combine Reward Gradient optimization with the FAIL framework (Adversarial Imitation Learning). This introduces a dynamic distribution-matching objective that acts as a regularizer, preventing the policy from collapsing into narrow, high-reward modes.
Implement FAIL-PD (Pathwise Derivative): Prefer the white-box FAIL-PD algorithm over FAIL-PG where possible. FAIL-PD provides dense, directional gradients via differentiable ODE solvers, preserving the structural smoothness of the flow manifold and ensuring long-term stability (up to 2,000+ steps).
Hybrid Batch Imitation: Utilize a hybrid batch strategy during training, mixing online policy samples with expert demonstrations (anchoring the policy to the expert manifold).
Discriminator Warmup: Implement a warmup phase (e.g., 25 steps) where the policy is frozen to ensure the discriminator provides meaningful signals before adversarial adaptation begins.

Reward Hacking Via Adversarial Imitation

Research Paper