โ† Jiajun Fan / Publications / ADRPO
๐ŸŽ‰ NeurIPS 2025

Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models

ADRPO: Sample-level adaptive KL control โ€” no manual tuning required

Jiajun Fan1, Tong Wei 1, Chaoran Cheng 1, Yuxin Chen 1, Ge Liu 1
1University of Illinois Urbana-Champaign
๐Ÿ“„ Paper (OpenReview) arXiv ๐Ÿ  Author Homepage
TL;DR โ€” Fixed KL regularization in RLHF creates a dilemma: strong regularization preserves diversity but limits alignment; weak regularization enables better alignment but risks collapse. ADRPO automatically adapts the regularization strength at the sample level based on advantage estimates โ€” high-quality samples get freedom to explore, poor samples get stronger constraints.

๐Ÿ’ก Key Insight

โš–๏ธ

Not all samples should be treated equally

Existing RLHF methods apply the same divergence regularization to every sample. ADRPO observes that high-advantage samples (clearly better than baseline) deserve less regularization to fully exploit their quality, while low-advantage samples need stronger regularization to prevent harmful policy updates. This simple principle unlocks significantly better exploration-exploitation trade-offs.

โŒ Fixed KL (ORW-CFM-W2 / PPO)

  • Same regularization for all samples
  • Strong KL โ†’ conservative, slow progress
  • Weak KL โ†’ reward hacking / collapse
  • Manual KL coefficient tuning required

โœ… ADRPO (Ours)

  • Adaptive KL based on advantage estimates
  • High-value samples explore more freely
  • Poor samples kept on a short leash
  • Plug-and-play โ€” no extra networks

๐ŸŒ Generalization

ADRPO is a universal plug-in that generalizes across different model types and modalities:

๐ŸŒŠ Flow/Diffusion Models (SD3) ๐Ÿง  Text LLMs (GRPO) ๐ŸŽต Audio LLMs

In LLM fine-tuning, ADRPO shows an emergent ability to escape local optima through active exploration. In multimodal audio reasoning, it outperforms GRPO through superior step-by-step reasoning.

๐Ÿ“Š Results

2B
SD3 model surpasses
4.8B & 12B models
>DPO
Beats offline DPO
in alignment & diversity
>ORW
Beats ORW-CFM-W2
(fixed regularization)
0
Extra networks needed
(plug-and-play)

Evaluated on text-to-image generation tasks: attribute binding, semantic consistency, artistic style transfer, and compositional control.

๐Ÿ“… Publication Journey

May 2025
Submitted to NeurIPS 2025
Submitted to The Thirty-ninth Annual Conference on Neural Information Processing Systems.
Oct 2025
arXiv preprint released (arXiv:2510.18053)
Sep 2025
โœ… Accepted at NeurIPS 2025 (Poster)
Accepted. OpenReview
Dec 2025
Presented at NeurIPS 2025 ยท San Diego, CA

๐Ÿ“– BibTeX

@inproceedings{fan2025adaptive,
  title={Adaptive Divergence Regularized Policy Optimization
         for Fine-tuning Generative Models},
  author={Jiajun Fan and Tong Wei and Chaoran Cheng
          and Yuxin Chen and Ge Liu},
  booktitle={The Thirty-ninth Annual Conference on
             Neural Information Processing Systems},
  year={2025},
  url={https://openreview.net/forum?id=aXO0xg0ttW}
}