✅ ICLR 2025

Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization

ORW-CFM-W2: First online RLHF framework for flow matching models — no human data, no collapse

1University of Illinois Urbana-Champaign    2CMU
TL;DR — RL fine-tuning of diffusion models is well-studied, but flow matching models (Stable Diffusion 3, Flux) remain underexplored due to unique challenges: no likelihood, policy collapse risk, high compute. ORW-CFM-W2 solves all three — online reward-weighting for alignment, Wasserstein-2 regularization for diversity, and a tractable bound that makes it efficient.
ORW-CFM-W2 Method Architecture
Figure 1. ORW-CFM-W2 architecture. Online reward-weighted fine-tuning guides the flow matching model toward high-reward regions, while Wasserstein-2 regularization (W2) prevents policy collapse and maintains generative diversity throughout training.

🚧 Why Flow Matching Is Hard to Fine-Tune

No Likelihood

Unlike diffusion models, flow matching has no tractable likelihood — standard KL regularization cannot be computed directly.

Policy Collapse

Without diversity constraints, reward maximization causes the model to collapse onto a few high-reward modes, losing generative quality.

No Human Data

Existing fine-tuning methods require expensive human-curated datasets or preference annotations. ORW-CFM-W2 needs neither.

⚙️ Three Key Contributions

1

Online Reward-Weighted Mechanism

Integrates RL into flow matching via reward weighting — guides the model to prioritize high-reward regions in data space, without requiring reward gradients or filtered datasets.

2

Wasserstein-2 Regularization

Derives a tractable upper bound for W2 distance in flow matching models — prevents policy collapse and maintains generation diversity throughout continuous optimization.

3

Unified RL Perspective

Establishes a connection between flow matching fine-tuning and traditional KL-regularized RL, enabling controllable reward-diversity trade-offs and deeper understanding of the learning behavior.

📊 Experiments

SD3
Successfully fine-tunes
Stable Diffusion 3
0
Human-collected data
required
Spatial understanding
& compositional tasks
SOTA
Orders of magnitude
less data than baselines

Tasks: target image generation, image compression, text-image alignment. Consistently achieves optimal policy convergence while allowing controllable diversity-reward trade-offs.

🎨 Visual Results on SD3

Positional understanding results
Figure 2. Spatial/positional understanding. ORW-CFM-W2 enables SD3 to correctly handle complex positional prompts that the base model fails on.
Multi-baseline comparison
Figure 3. Comparison with multiple baselines on target image generation.
Multi-reward optimization
Figure 4. Multi-reward optimization across diverse prompts and reward signals.
Semantic detail enhancement
Figure 5. Semantic detail enhancement. Fine-tuned SD3 generates images with richer detail and better semantic correspondence.

🔗 Unified RL Framework

RL framework for flow matching
Figure 6. Unified RL perspective on flow matching fine-tuning. We establish a formal connection between flow matching optimization and KL-regularized reinforcement learning, providing theoretical grounding for reward-diversity trade-offs.

📉 Quantitative Analysis

Controlled experiments on MNIST and CIFAR demonstrate the reward-diversity trade-off controlled by the W2 regularization strength α:

MNIST reward curves
Figure 7. MNIST reward curves across different α values — higher α preserves more diversity.
CIFAR reward tradeoff
Figure 8. CIFAR reward vs. W2 distance Pareto front — smooth, controllable trade-off.
CIFAR W2 tradeoff
Figure 9. W2 distance evolution during training — regularization effectively prevents mode collapse.
CIFAR text reward
Figure 10. Text-conditional CIFAR: reward improvement with controlled diversity preservation.

🎛️ Alpha Control: Reward–Diversity Knob

The regularization strength α provides a smooth knob between pure reward maximization (α=0) and full diversity preservation (α=1):

α=0
α = 0
Pure reward (mode collapse)
α=0.3
α = 0.3
Balanced (recommended)
α=0.8
α = 0.8
High diversity preserved
SD3 Base
SD3 Base
Before fine-tuning
α=0.5
α = 0.5
Balanced fine-tuning
α=1.0
α = 1.0
Full reward optimization

📅 Publication Journey

Oct 2024
Submitted to ICLR 2025
Submitted to The Thirteenth International Conference on Learning Representations.
Feb 2025
✅ Accepted at ICLR 2025 (Poster)
Accepted as a poster. OpenReview
Apr 2025
Presented at ICLR 2025 · Singapore

📖 BibTeX

@inproceedings{fan2025online,
  title={Online Reward-Weighted Fine-Tuning of Flow Matching
         with Wasserstein Regularization},
  author={Jiajun Fan and Shuaike Shen and Chaoran Cheng
          and Yuxin Chen and Chumeng Liang and Ge Liu},
  booktitle={The Thirteenth International Conference on
             Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=2IoFFexvuw}
}