โ† Jiajun Fan / Publications / ORW-CFM-W2
โœ… ICLR 2025

Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization

ORW-CFM-W2: First online RLHF framework for flow matching models โ€” no human data, no collapse

Jiajun Fan1, Shuaike Shen 1, Chaoran Cheng 1, Yuxin Chen 1, Chumeng Liang 2, Ge Liu 1
1University of Illinois Urbana-Champaign    2CMU
๐Ÿ“„ Paper (OpenReview) ๐Ÿ  Author Homepage
TL;DR โ€” RL fine-tuning of diffusion models is well-studied, but flow matching models (Stable Diffusion 3, Flux) remain underexplored due to unique challenges: no likelihood, policy collapse risk, high compute. ORW-CFM-W2 solves all three โ€” online reward-weighting for alignment, Wasserstein-2 regularization for diversity, and a tractable bound that makes it efficient.

๐Ÿšง Why Flow Matching Is Hard to Fine-Tune

No Likelihood

Unlike diffusion models, flow matching has no tractable likelihood โ€” standard KL regularization cannot be computed directly.

Policy Collapse

Without diversity constraints, reward maximization causes the model to collapse onto a few high-reward modes, losing generative quality.

No Human Data

Existing fine-tuning methods require expensive human-curated datasets or preference annotations. ORW-CFM-W2 needs neither.

โš™๏ธ Three Key Contributions

1

Online Reward-Weighted Mechanism

Integrates RL into flow matching via reward weighting โ€” guides the model to prioritize high-reward regions in data space, without requiring reward gradients or filtered datasets.

2

Wasserstein-2 Regularization

Derives a tractable upper bound for W2 distance in flow matching models โ€” prevents policy collapse and maintains generation diversity throughout continuous optimization.

3

Unified RL Perspective

Establishes a connection between flow matching fine-tuning and traditional KL-regularized RL, enabling controllable reward-diversity trade-offs and deeper understanding of the learning behavior.

๐Ÿ“Š Experiments

SD3
Successfully fine-tunes
Stable Diffusion 3
0
Human-collected data
required
โœ“
Spatial understanding
& compositional tasks
SOTA
Orders of magnitude
less data than baselines

Tasks: target image generation, image compression, text-image alignment. Consistently achieves optimal policy convergence while allowing controllable diversity-reward trade-offs.

๐Ÿ“… Publication Journey

Oct 2024
Submitted to ICLR 2025
Submitted to The Thirteenth International Conference on Learning Representations.
Feb 2025
โœ… Accepted at ICLR 2025 (Poster)
Accepted as a poster. OpenReview
Apr 2025
Presented at ICLR 2025 ยท Singapore

๐Ÿ“– BibTeX

@inproceedings{fan2025online,
  title={Online Reward-Weighted Fine-Tuning of Flow Matching
         with Wasserstein Regularization},
  author={Jiajun Fan and Shuaike Shen and Chaoran Cheng
          and Yuxin Chen and Chumeng Liang and Ge Liu},
  booktitle={The Thirteenth International Conference on
             Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=2IoFFexvuw}
}