ORW-CFM-W2: First online RLHF framework for flow matching models โ no human data, no collapse
Unlike diffusion models, flow matching has no tractable likelihood โ standard KL regularization cannot be computed directly.
Without diversity constraints, reward maximization causes the model to collapse onto a few high-reward modes, losing generative quality.
Existing fine-tuning methods require expensive human-curated datasets or preference annotations. ORW-CFM-W2 needs neither.
Integrates RL into flow matching via reward weighting โ guides the model to prioritize high-reward regions in data space, without requiring reward gradients or filtered datasets.
Derives a tractable upper bound for W2 distance in flow matching models โ prevents policy collapse and maintains generation diversity throughout continuous optimization.
Establishes a connection between flow matching fine-tuning and traditional KL-regularized RL, enabling controllable reward-diversity trade-offs and deeper understanding of the learning behavior.
Tasks: target image generation, image compression, text-image alignment. Consistently achieves optimal policy convergence while allowing controllable diversity-reward trade-offs.
@inproceedings{fan2025online,
title={Online Reward-Weighted Fine-Tuning of Flow Matching
with Wasserstein Regularization},
author={Jiajun Fan and Shuaike Shen and Chaoran Cheng
and Yuxin Chen and Chumeng Liang and Ge Liu},
booktitle={The Thirteenth International Conference on
Learning Representations},
year={2025},
url={https://openreview.net/forum?id=2IoFFexvuw}
}