Jiajun Fan

Jiajun Fan
🌊 RL Post-Training for Generative Models 🧠 Multimodal Reasoning LLMs 🎮 Superhuman Deep RL 🤖 Agentic RL

CS Ph.D. student at UIUC. I work on RL post-training for generative models — making diffusion/flow models and multimodal reasoning LLMs continuously self-improve with minimal human supervision. Previously: 24 Atari world records, 500× more data-efficient than Agent57, ICLR 2023 Oral (rank 5/4176).

🎓 Seeking research internship — Fall 2026 / 2027. RL · Generative Models · Reasoning LLMs · Agentic RL  [CV]  [Scholar]  [Email]

📰 Latest News

  • Apr 2026 AcceptSeveral papers accepted at ICML 2026 (Seoul, Jul 6–11). See you in Seoul 🇰🇷
  • Apr 2026 Finish🇧🇷 Presented CESAR & SP-VLA at ICLR 2026, Rio de Janeiro, Apr 23–27.
  • Jan 2026 Accept2 papers at ICLR 2026 — CESAR & SP-VLA. See you in Rio 🇧🇷
  • Sep 2025 Accept2 papers at NeurIPS 2025 — ADRPO & VarCon. See you in San Diego 🌊
  • Jun 2025 AcceptPaper accepted at IEEE TPAMI: PRANCE.
  • Feb 2025 AcceptPaper accepted at ICLR 2025: ORW-CFM-W2 (Flow Matching self-evolution).
  • Jan 2025 ServiceReviewer: ICLR 2024–26, NeurIPS 2022–25, ICML 2023–26, CVPR 2026, AAAI 2025, AISTATS 2025, KDD 2024.
  • Aug 2024 🎓 Started Ph.D. at UIUC CS (GPA 4.0/4.0).
  • Jan 2023 Oral · Top 5%LBC at ICLR 2023, ranked 5/4176 — broke 24 Atari world records.

📄 Selected Publications

* = first/co-first author  ·  Full list on Google Scholar  /  Publications page

ICLR 20262026
CESAR framework
Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards Project Page
CESAR resolves test-time inverse scaling in Audio LLMs by rewarding the reasoning process via GRPO, achieving SOTA on MMAU — outperforming Gemini 2.5 Pro and GPT-4o Audio.
J. Fan*, R. Ren, J. Li, R. Pandey, P.G. Shivakumar, I. Bulyko, A. Gandhe, G. Liu, Y. Gu
CESAR: process-reward RL (GRPO) resolving test-time inverse scaling in Audio LLMs — models produce hallucinatory reasoning without proper guidance; CESAR fixes that.
🏆 SOTA on MMAU Test-mini · Outperforms Gemini 2.5 Pro & GPT-4o Audio
ICLR 20262026
SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration
SP-VLA introduces action-aware model scheduling and spatio-semantic token pruning for VLA model acceleration, achieving 1.5× lossless speedup on LIBERO and 2.4× speedup on SimplerEnv.
Y. Li, Y. Meng, Z. Sun, K. Ji, C. Tang, J. Fan, X. Ma, S.-T. Xia, Z. Wang, W. Zhu
Action-aware model scheduling + spatio-semantic token pruning for VLA acceleration.
⚡ 1.5× lossless speedup (LIBERO) · 2.4× speedup (SimplerEnv)
NeurIPS 20252025
ADRPO qualitative results
Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models Project Page
ADRPO introduces sample-level adaptive divergence regularization for RLHF — high-value samples get more freedom, poor samples get stronger constraints. Plug-and-play on any RL method.
J. Fan*, T. Wei, C. Cheng, Y. Chen, G. Liu
ADRPO: sample-level adaptive divergence regularization — high-value samples get more freedom, poor samples get stronger constraint. Plug-and-play on top of any RLHF method.
🚀 2B SD3 surpasses 4.8B & 12B models · Generalizes to LLMs & audio reasoning
NeurIPS 20252025
Variational Supervised Contrastive Learning
VarCon reformulates supervised contrastive learning as variational inference, achieving SOTA 79.36% Top-1 accuracy on ImageNet-1K with ResNet-50.
📊 SOTA 79.36% Top-1 on ImageNet-1K with ResNet-50
Z. Wang, J. Fan, T. Nguyen, H. Ji, G. Liu
VarCon: supervised contrastive learning as variational inference — posterior-weighted ELBO replaces pairwise comparisons.
ICLR 20252025
ORW-CFM-W2 method
Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization Project Page
ORW-CFM-W2 is the first online RLHF method for flow matching — no human data, no likelihood estimation. Wasserstein regularization maintains generation diversity.
🥇 First online RLHF for flow matching · Collapse-free W2 regularization
J. Fan*, S. Shen, C. Cheng, Y. Chen, C. Liang, G. Liu
ORW-CFM-W2: first online RLHF for flow matching — no human data, no likelihood, no collapse. W2 regularization keeps generation diverse.
Preprint2025
Fine-tuning Flow Matching Generative Models with Intermediate Feedback Project Page
AC-Flow introduces actor-critic with intermediate feedback for flow matching — reward shaping + dual-stability mechanism + Wasserstein regularization enables robust SD3 fine-tuning without collapse.
⚙️ Actor-critic with step-level reward · Stable SD3 fine-tuning without collapse
J. Fan*, C. Cheng, S. Shen, X. Zhou, G. Liu  ·  Under Review
AC-Flow: actor-critic with intermediate feedback for flow matching — reward shaping + dual-stability + Wasserstein regularization. Robust fine-tuning on SD3 without collapse.
TPAMI 20262026
PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference
PRANCE jointly optimizes token pruning and structural channel pruning for adaptive ViT inference, achieving significant speedup while maintaining accuracy.
⚡ Joint token + channel pruning · Adaptive ViT inference · IEEE TPAMI 2026
Y. Li, C. Tang, Y. Meng, J. Fan, Z. Chai, X. Ma, Z. Wang, W. Zhu  ·  IEEE TPAMI
ICLR 2023
Oral
2023
Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection Project Page
LBC introduces a learnable hybrid behavior mapping and bandit meta-controller for exploration control in deep RL, breaking 24 Atari human world records with 500× less data than prior SOTA.
J. Fan*, Y. Zhuang, Y. Liu, J. Hao, B. Wang, J. Zhu, H. Wang, S.-T. Xia
LBC: learnable hybrid behavior mapping + bandit meta-controller. Unified framework for exploration control in deep RL.
🏅 Ranked 5/4176 · 10,077% mean human score · 24 world records · 500× data efficiency
ICML 20222022
Generalized Data Distribution Iteration Project Page
GDI shows that optimizing the training data distribution is the key lever for superhuman RL efficiency. Provides a unified framework that subsumes diverse RL algorithms as special cases.
J. Fan*, C. Xiao
GDI: optimizing the data distribution is the key to superhuman RL efficiency. Unified framework for diverse RL algorithms.
📈 Agent57 beaten with 500× less data & 2× avg performance

🕸️ Research Paper Network

Hover a node to highlight connections. Papers are grouped by research theme.

🔬 Research Interests

🌊
RL Post-Training for Generative Models
Collapse-free online RLHF for flow/diffusion models. No human-collected preference data needed — models improve from their own generations (ORW-CFM-W2, ADRPO, AC-Flow).
🧠
Reasoning in Multimodal LLMs
Process-reward RL for audio/visual LLMs — fixing test-time inverse scaling so reasoning actually helps, not hurts (CESAR).
🎮
Superhuman-Level Deep RL
Sample-efficient RL that exceeds human performance. Broke 24 Atari world records with 500× less data than prior SOTA (LBC, GDI).

⚡ Impact at a Glance

0
Top Venue Papers
ICLR · NeurIPS · ICML · TPAMI
0
Atari World Records
broken by LBC (ICLR'23 Oral)
0
More Data-Efficient
than Agent57
SOTA
MMAU Audio Reasoning
Beats Gemini 2.5 Pro
0
Google Scholar Citations
4.0
GPA — UIUC Ph.D.
Computer Science

💡 Research Vision

Making AI Systems That Improve Themselves

Today's AI is frozen after training. I work to change that: AI that never stops getting better, with progressively less human scaffolding.

Step 1 — ICLR 2025
Eliminate human-collected preference data
ORW-CFM-W2: online reward-weighted training lets models improve from their own generations — no paired human data needed.
Step 2 — NeurIPS 2025
Remove manual KL tuning
ADRPO: adaptive divergence control eliminates the need for hand-tuned regularization — each sample gets its own constraint.
Step 3 — ICLR 2026
Reward the reasoning process, not just outcomes
CESAR: process-level rewards resolve test-time inverse scaling in Audio LLMs — reasoning finally helps instead of hurts, achieving SOTA on MMAU.
Step 4 — Ongoing
Fully autonomous self-improvement
The endgame: generative models that continuously improve with progressively less human intervention — from data collection to reward design to training itself.

🏅 Awards & Academic Service

🎖 Selected Awards

  • National Scholarship ×2, Top 1% — Nankai Univ.
  • Ranked 1st / 83 in major — Nankai Univ.
  • Outstanding Graduates (Top 1%) — Nankai Univ.
  • Tang Lixin Scholarship (Top 1%)
  • GPA 4.0/4.0 — UIUC Ph.D.
  • ICLR 2023 Oral (Top 0.12%, 5/4176) — LBC paper
  • GPA 3.97/4.0, Top 1.3% — Tsinghua M.Eng.

🔍 Reviewer

  • ICLR 2024 · 2025 · 2026
  • NeurIPS 2022–2024 · 2025
  • ICML 2023–2024 · 2025 · 2026
  • CVPR 2026
  • AAAI 2025 · AISTATS 2025 · KDD 2024

📅 Conference Deadlines

Key AI/ML venue deadlines I track — for the full list see ccfddl.com.

📬 Contact

Happy to discuss research, internships, or collaborations. Best reached by email.
📧 jiajunf3@illinois.edu  ·  🏛 Siebel Center for CS, UIUC  ·  CV  ·  💼 LinkedIn  ·  🔬 ORCID