✅ ICLR 2025

Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization

ORW-CFM-W2: First online RLHF framework for flow matching models — no human data, no collapse

Jiajun Fan ¹, Shuaike Shen ¹, Chaoran Cheng ¹, Yuxin Chen ¹, Chumeng Liang ², Ge Liu ¹

¹University of Illinois Urbana-Champaign ²CMU

TL;DR — RL fine-tuning of diffusion models is well-studied, but flow matching models (Stable Diffusion 3, Flux) remain underexplored due to unique challenges: no likelihood, policy collapse risk, high compute. ORW-CFM-W2 solves all three — online reward-weighting for alignment, Wasserstein-2 regularization for diversity, and a tractable bound that makes it efficient.

🚧 Why Flow Matching Is Hard to Fine-Tune

No Likelihood

Unlike diffusion models, flow matching has no tractable likelihood — standard KL regularization cannot be computed directly.

Policy Collapse

Without diversity constraints, reward maximization causes the model to collapse onto a few high-reward modes, losing generative quality.

No Human Data

Existing fine-tuning methods require expensive human-curated datasets or preference annotations. ORW-CFM-W2 needs neither.

⚙️ Three Key Contributions

Online Reward-Weighted Mechanism

Integrates RL into flow matching via reward weighting — guides the model to prioritize high-reward regions in data space, without requiring reward gradients or filtered datasets.

Wasserstein-2 Regularization

Derives a tractable upper bound for W2 distance in flow matching models — prevents policy collapse and maintains generation diversity throughout continuous optimization.

Unified RL Perspective

Establishes a connection between flow matching fine-tuning and traditional KL-regularized RL, enabling controllable reward-diversity trade-offs and deeper understanding of the learning behavior.

📊 Experiments

SD3

Successfully fine-tunes
Stable Diffusion 3

Human-collected data
required

✓

Spatial understanding
& compositional tasks

SOTA

Orders of magnitude
less data than baselines

Tasks: target image generation, image compression, text-image alignment. Consistently achieves optimal policy convergence while allowing controllable diversity-reward trade-offs.

📅 Publication Journey

Oct 2024

Submitted to ICLR 2025

Submitted to The Thirteenth International Conference on Learning Representations.

Feb 2025

✅ Accepted at ICLR 2025 (Poster)

Accepted as a poster. OpenReview

Apr 2025

Presented at ICLR 2025 · Singapore

📖 BibTeX

@inproceedings{fan2025online,
  title={Online Reward-Weighted Fine-Tuning of Flow Matching
         with Wasserstein Regularization},
  author={Jiajun Fan and Shuaike Shen and Chaoran Cheng
          and Yuxin Chen and Chumeng Liang and Ge Liu},
  booktitle={The Thirteenth International Conference on
             Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=2IoFFexvuw}
}