📝 Preprint 2025 · Under Review

Fine-tuning Flow Matching Generative Models with Intermediate Feedback

AC-Flow: Robust actor-critic framework for flow matching — stable intermediate value learning without collapse

Jiajun Fan ¹, Chaoran Cheng ¹, Shuaike Shen ¹, Xiangxin Zhou ¹, Ge Liu ¹

¹University of Illinois Urbana-Champaign

TL;DR — Existing RLHF methods for flow matching only use outcome rewards (ORW-CFM-W2), suffering from credit assignment problems. AC-Flow introduces a full actor-critic framework with intermediate feedback — reward shaping + dual-stability + generalized critic weighting — achieving SOTA text-to-image alignment on SD3 without degrading diversity or stability.

🔧 Three Key Innovations

Reward Shaping

Provides well-normalized learning signals for stable intermediate value learning and gradient control — enabling the critic to reason about multi-step trajectories.

Dual-Stability Mechanism

Combines advantage clipping (prevents destructive policy updates) with a critic warm-up phase (lets critic mature before guiding the actor).

Generalized Critic Weighting

Extends reward-weighted methods while preserving model diversity via Wasserstein regularization — compatible with ORW-CFM-W2 as a special case.

🔄 AC-Flow vs ORW-CFM-W2

ORW-CFM-W2 (ICLR 2025)

Outcome reward only
No intermediate value learning
Credit assignment challenge
First online RLHF for flow matching

AC-Flow (This Work)

Intermediate feedback + actor-critic
Stable value learning via reward shaping
Dual-stability prevents collapse
SOTA on SD3 with even less data

AC-Flow generalizes ORW-CFM-W2 — the critic weighting scheme subsumes reward-weighted methods as a special case.

📖 Cite This Paper

@article{fan2025acflow,
  title = {Fine-tuning Flow Matching Generative Models with Intermediate Feedback},
  author = {Jiajun Fan and Chaoran Cheng and Shuaike Shen
           and Xiangxin Zhou and Ge Liu},
  journal = {arXiv preprint arXiv:2510.18072},
  year = {2025},
  url = {https://arxiv.org/abs/2510.18072}
}