🎉 NeurIPS 2025

Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models

ADRPO: Sample-level adaptive divergence control — no manual tuning required

1University of Illinois Urbana-Champaign
TL;DR — Fixed divergence regularization in RLHF creates a dilemma: strong regularization preserves diversity but limits alignment; weak regularization enables better alignment but risks collapse. ADRPO automatically adapts the regularization strength at the sample level based on advantage estimates — high-quality samples get freedom to explore, poor samples get stronger constraints. Works with W2 (flow matching) and KL (LLMs).
ADRPO Qualitative Results — 2B vs 12B
Figure 1. Qualitative comparison. Our 2B SD3 model fine-tuned with ADRPO matches or outperforms FLUX.1-Dev (12B) and SANA-1.5 (4.8B) on artistic style rendering, attribute binding, coloring, and compositional control — with 2–6× fewer parameters.

💡 Key Insight

⚖️

Not all samples should be treated equally

Existing RLHF methods apply the same divergence regularization to every sample. ADRPO observes that high-advantage samples (clearly better than baseline) deserve less regularization to fully exploit their quality, while low-advantage samples need stronger regularization to prevent harmful policy updates. This simple principle unlocks significantly better exploration-exploitation trade-offs.

❌ Fixed Regularization (ORW-CFM-W2 / PPO)

  • Same regularization for all samples
  • Strong KL → conservative, slow progress
  • Weak KL → reward hacking / collapse
  • Manual KL coefficient tuning required

✅ ADRPO (Ours)

  • Adaptive KL based on advantage estimates
  • High-value samples explore more freely
  • Poor samples kept on a short leash
  • Plug-and-play — no extra networks

🌐 Generalization

ADRPO is a universal plug-in that generalizes across different model types and modalities:

🌊 Flow/Diffusion Models (SD3) 🧠 Text LLMs (GRPO) 🎵 Audio LLMs

In LLM fine-tuning, ADRPO shows an emergent ability to escape local optima through active exploration. In multimodal audio reasoning, it outperforms GRPO through superior step-by-step reasoning.

📊 Results

2B
SD3 model surpasses
4.8B & 12B models
>DPO
Beats offline DPO
in alignment & diversity
>ORW
Beats ORW-CFM-W2
(fixed regularization)
0
Extra networks needed
(plug-and-play)

Evaluated on text-to-image generation tasks: attribute binding, semantic consistency, artistic style transfer, and compositional control.

ADRPO vs other RL methods
Figure 2. Comparison with RL fine-tuning baselines (DPO, ORW-CFM-W2). ADRPO achieves superior style fidelity, spatial reasoning, and attribute binding.

📈 Reward–Diversity Tradeoff

Reward vs KL tradeoff
Figure 3. Reward vs. KL divergence tradeoff. ADRPO achieves higher reward at lower KL than fixed-regularization baselines — demonstrating that adaptive control finds a better exploration–exploitation balance.

⚖️ Reward vs. Diversity Analysis

The core insight: fixed-KL methods force a single regularization strength across all samples. ADRPO adaptively adjusts — easy samples get tighter control, hard samples get more freedom.

Reward vs diversity
Figure 4. Reward vs. diversity Pareto front. ADRPO achieves higher reward at any given diversity level.
Reward vs KL
Figure 5. Reward vs. KL divergence. ADRPO achieves the same reward at lower KL — more efficient exploration.

🎨 More Qualitative Results

Extended comparison with larger models
Figure 4. Extended comparison — ADRPO's 2B SD3 vs. FLUX.1-Dev (12B) and SANA-1.5 (4.8B) on additional prompts. ADRPO consistently matches or outperforms models with 2–6× more parameters.
Extended comparison with RL methods
Figure 5. Extended RL method comparison — ADRPO vs. DPO and ORW-CFM-W2 on diverse T2I tasks.

🧠 Generalization to LLMs

ADRPO is not limited to image generation — it generalizes to LLM fine-tuning tasks. Tested on Qwen2 and Qwen3:

LLM reward vs entropy (Qwen2)
Figure 8. Qwen2: ADRPO achieves higher reward while maintaining generation diversity.
LLM reward vs entropy (Qwen3)
Figure 9. Qwen3: consistent advantage — ADRPO's adaptive regularization generalizes across model families.

📅 Publication Journey

May 2025
Submitted to NeurIPS 2025
Submitted to The Thirty-ninth Annual Conference on Neural Information Processing Systems.
Oct 2025
arXiv preprint released (arXiv:2510.18053)
Sep 2025
✅ Accepted at NeurIPS 2025 (Poster)
Accepted. OpenReview
Dec 2025
Presented at NeurIPS 2025 · San Diego, CA

📖 BibTeX

@inproceedings{fan2025adaptive,
  title={Adaptive Divergence Regularized Policy Optimization
         for Fine-tuning Generative Models},
  author={Jiajun Fan and Tong Wei and Chaoran Cheng
          and Yuxin Chen and Ge Liu},
  booktitle={The Thirty-ninth Annual Conference on
             Neural Information Processing Systems},
  year={2025},
  url={https://openreview.net/forum?id=aXO0xg0ttW}
}