ADRPO: Sample-level adaptive KL control โ no manual tuning required
Existing RLHF methods apply the same divergence regularization to every sample. ADRPO observes that high-advantage samples (clearly better than baseline) deserve less regularization to fully exploit their quality, while low-advantage samples need stronger regularization to prevent harmful policy updates. This simple principle unlocks significantly better exploration-exploitation trade-offs.
ADRPO is a universal plug-in that generalizes across different model types and modalities:
๐ Flow/Diffusion Models (SD3) ๐ง Text LLMs (GRPO) ๐ต Audio LLMs
In LLM fine-tuning, ADRPO shows an emergent ability to escape local optima through active exploration. In multimodal audio reasoning, it outperforms GRPO through superior step-by-step reasoning.
Evaluated on text-to-image generation tasks: attribute binding, semantic consistency, artistic style transfer, and compositional control.
@inproceedings{fan2025adaptive,
title={Adaptive Divergence Regularized Policy Optimization
for Fine-tuning Generative Models},
author={Jiajun Fan and Tong Wei and Chaoran Cheng
and Yuxin Chen and Ge Liu},
booktitle={The Thirty-ninth Annual Conference on
Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=aXO0xg0ttW}
}