🎉 NeurIPS 2025

Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models

ADRPO: Sample-level adaptive KL control — no manual tuning required

Jiajun Fan ¹, Tong Wei ¹, Chaoran Cheng ¹, Yuxin Chen ¹, Ge Liu ¹

¹University of Illinois Urbana-Champaign

📄 Paper (OpenReview) arXiv 🏠 Author Homepage

TL;DR — Fixed KL regularization in RLHF creates a dilemma: strong regularization preserves diversity but limits alignment; weak regularization enables better alignment but risks collapse. ADRPO automatically adapts the regularization strength at the sample level based on advantage estimates — high-quality samples get freedom to explore, poor samples get stronger constraints.

💡 Key Insight

⚖️

Not all samples should be treated equally

Existing RLHF methods apply the same divergence regularization to every sample. ADRPO observes that high-advantage samples (clearly better than baseline) deserve less regularization to fully exploit their quality, while low-advantage samples need stronger regularization to prevent harmful policy updates. This simple principle unlocks significantly better exploration-exploitation trade-offs.

❌ Fixed KL (ORW-CFM-W2 / PPO)

Same regularization for all samples
Strong KL → conservative, slow progress
Weak KL → reward hacking / collapse
Manual KL coefficient tuning required

✅ ADRPO (Ours)

Adaptive KL based on advantage estimates
High-value samples explore more freely
Poor samples kept on a short leash
Plug-and-play — no extra networks

🌐 Generalization

ADRPO is a universal plug-in that generalizes across different model types and modalities:

🌊 Flow/Diffusion Models (SD3) 🧠 Text LLMs (GRPO) 🎵 Audio LLMs

In LLM fine-tuning, ADRPO shows an emergent ability to escape local optima through active exploration. In multimodal audio reasoning, it outperforms GRPO through superior step-by-step reasoning.

📊 Results

SD3 model surpasses
4.8B & 12B models

>DPO

Beats offline DPO
in alignment & diversity

>ORW

Beats ORW-CFM-W2
(fixed regularization)

Extra networks needed
(plug-and-play)

Evaluated on text-to-image generation tasks: attribute binding, semantic consistency, artistic style transfer, and compositional control.

📅 Publication Journey

May 2025

Submitted to NeurIPS 2025

Submitted to The Thirty-ninth Annual Conference on Neural Information Processing Systems.

Oct 2025

arXiv preprint released (arXiv:2510.18053)

Sep 2025

✅ Accepted at NeurIPS 2025 (Poster)

Accepted. OpenReview

Dec 2025

Presented at NeurIPS 2025 · San Diego, CA

📖 BibTeX

@inproceedings{fan2025adaptive,
  title={Adaptive Divergence Regularized Policy Optimization
         for Fine-tuning Generative Models},
  author={Jiajun Fan and Tong Wei and Chaoran Cheng
          and Yuxin Chen and Ge Liu},
  booktitle={The Thirty-ninth Annual Conference on
             Neural Information Processing Systems},
  year={2025},
  url={https://openreview.net/forum?id=aXO0xg0ttW}
}