TL;DR — Existing RLHF methods for flow matching only use outcome rewards (ORW-CFM-W2), suffering from credit assignment problems. AC-Flow introduces a full actor-critic framework with intermediate feedback — reward shaping + dual-stability + generalized critic weighting — achieving SOTA text-to-image alignment on SD3 without degrading diversity or stability.
🔧 Three Key Innovations
1
Reward Shaping
Provides well-normalized learning signals for stable intermediate value learning and gradient control — enabling the critic to reason about multi-step trajectories.
2
Dual-Stability Mechanism
Combines advantage clipping (prevents destructive policy updates) with a critic warm-up phase (lets critic mature before guiding the actor).
3
Generalized Critic Weighting
Extends reward-weighted methods while preserving model diversity via Wasserstein regularization — compatible with ORW-CFM-W2 as a special case.
🧩 Why Intermediate Rewards Matter
Standard RLHF for flow matching gives feedback only at the end of the denoising chain — a sparse signal over many steps. AC-Flow instead assigns intermediate rewards at each timestep via an actor-critic formulation:
Denser gradients → more stable learning, fewer catastrophic updates
Critic as a value estimator → predicts future reward quality at each denoising step
The result: fine-tuning that generalizes beyond direct reward optimization — improvements transfer to unseen LLM and audio tasks (not just image generation).
🔄 AC-Flow vs ORW-CFM-W2
ORW-CFM-W2 (ICLR 2025)
Outcome reward only
No intermediate value learning
Credit assignment challenge
First online RLHF for flow matching
AC-Flow (This Work)
Intermediate feedback + actor-critic
Stable value learning via reward shaping
Dual-stability prevents collapse
SOTA on SD3 with even less data
AC-Flow generalizes ORW-CFM-W2 — the critic weighting scheme subsumes reward-weighted methods as a special case.
📊 Key Claims
⚙️ Actor-Critic
Step-level rewards via intermediate feedback
🛡️ Collapse-Free
Wasserstein reg + dual-stability mechanism
🎯 SD3 Fine-Tuning
Stable fine-tuning without mode collapse
🔗 Related Work in This Series
AC-Flow builds on the online RLHF series for generative models:
AC-Flow (This work) — Actor-critic with intermediate step-level rewards
📖 Cite This Paper
@misc{fan2025finetuningflowmatchinggenerative,
title={Fine-tuning Flow Matching Generative Models with Intermediate Feedback},
author={Jiajun Fan and Chaoran Cheng and Shuaike Shen and Xiangxin Zhou and Ge Liu},
year={2025},
eprint={2510.18072},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.18072}
}