Jiajun Fan

Jiajun Fan
🌊 RL Post-Training for Generative Models 🧠 Multimodal Reasoning LLMs 🎮 Superhuman Deep RL

I am a Computer Science Ph.D. student at UIUC. My research focuses on autonomous RL post-training for large generative models — making diffusion/flow models and multi-modal reasoning LLMs continuously self-improve with less and less human intervention. Previously, I pushed RL to superhuman performance: breaking 24 Atari world records and outperforming Agent57 with 500× less data.

🎓 Seeking research internship — Summer 2026.  [CV]  [Scholar]  [Email]
📰 Latest News
  • Jan 2026 Accept2 papers at ICLR 2026 — CESAR & SP-VLA. See you in Rio 🇧🇷
  • Sep 2025 Accept2 papers at NeurIPS 2025 — ADRPO & VarCon. See you in San Diego 🌊
  • Jun 2025 AcceptPaper accepted at IEEE TPAMI: PRANCE.
  • Feb 2025 AcceptPaper accepted at ICLR 2025: ORW-CFM-W2 (Flow Matching self-evolution).
  • Jan 2025 ServiceReviewer: ICLR 2025, NeurIPS 2024, CVPR 2026, AAAI 2025, AISTATS 2025.
  • Aug 2024 🎓 Started Ph.D. at UIUC CS (GPA 4.0/4.0).
  • Jan 2023 Oral · Top 5%LBC at ICLR 2023, ranked 5/4176 — broke 24 Atari world records.

📄 Selected Publications

* = first/co-first author  ·  Full list on Google Scholar  /  Publications page

ICLR 20262026
CESAR framework
Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards Project Page
CESAR resolves test-time inverse scaling in Audio LLMs by rewarding the reasoning process via GRPO, achieving SOTA on MMAU — outperforming Gemini 2.5 Pro and GPT-4o Audio.
J. Fan*, R. Ren, J. Li, R. Pandey, P.G. Shivakumar, A. Gandhe, G. Liu, Y. Gu, I. Bulyko
CESAR: process-reward RL (GRPO) resolving test-time inverse scaling in Audio LLMs — models produce hallucinatory reasoning without proper guidance; CESAR rewrites that.
🏆 SOTA on MMAU Test-mini · Outperforms Gemini 2.5 Pro & GPT-4o Audio
ICLR 20262026
SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration
SP-VLA introduces action-aware model scheduling and spatio-semantic token pruning for VLA model acceleration, achieving 1.5× lossless speedup on LIBERO and 2.4× speedup on SimplerEnv.
Y. Li, Y. Meng, Z. Sun, K. Ji, C. Tang, J. Fan, X. Ma, S.-T. Xia, Z. Wang, W. Zhu
Action-aware model scheduling + spatio-semantic token pruning for VLA acceleration.
⚡ 1.5× lossless speedup (LIBERO) · 2.4× speedup (SimplerEnv)
NeurIPS 20252025
ADRPO qualitative results
Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models Project Page
ADRPO introduces sample-level adaptive divergence regularization for RLHF — high-value samples get more freedom, poor samples get stronger constraints. Plug-and-play on any RL method.
J. Fan*, T. Wei, C. Cheng, Y. Chen, G. Liu
ADRPO: sample-level adaptive divergence regularization — high-value samples get more freedom, poor samples get stronger constraint. Plug-and-play on top of any RLHF method.
🚀 2B SD3 surpasses 4.8B & 12B models · Generalizes to LLMs & audio reasoning
NeurIPS 20252025
Variational Supervised Contrastive Learning
VarCon reformulates supervised contrastive learning as variational inference, achieving SOTA 79.36% Top-1 accuracy on ImageNet-1K with ResNet-50.
Z. Wang, J. Fan, T. Nguyen, H. Ji, G. Liu
VarCon: supervised contrastive learning as variational inference — posterior-weighted ELBO replaces pairwise comparisons.
📊 SOTA 79.36% Top-1 on ImageNet-1K (ResNet-50)
ICLR 20252025
ORW-CFM-W2 method
Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization Project Page
ORW-CFM-W2 is the first online RLHF method for flow matching — no human data, no likelihood estimation. Wasserstein regularization maintains generation diversity.
J. Fan*, S. Shen, C. Cheng, Y. Chen, C. Liang, G. Liu
ORW-CFM-W2: first online RLHF for flow matching — no human data, no likelihood, no collapse. W2 regularization keeps generation diverse.
Preprint2025
Fine-tuning Flow Matching Generative Models with Intermediate Feedback Project Page
AC-Flow introduces actor-critic with intermediate feedback for flow matching — reward shaping + dual-stability mechanism + Wasserstein regularization enables robust SD3 fine-tuning without collapse.
J. Fan*, C. Cheng, S. Shen, X. Zhou, G. Liu  ·  Under Review
AC-Flow: actor-critic with intermediate feedback for flow matching — reward shaping + dual-stability + Wasserstein regularization. Robust fine-tuning on SD3 without collapse.
TPAMI 20262026
PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference
PRANCE jointly optimizes token pruning and structural channel pruning for adaptive ViT inference, achieving significant speedup while maintaining accuracy.
Y. Li, C. Tang, Y. Meng, J. Fan, Z. Chai, X. Ma, Z. Wang, W. Zhu  ·  IEEE TPAMI
ICLR 2023
Oral
2023
Learnable Behavior Control: Breaking Atari Human World Records via Sample-Efficient Behavior Selection Project Page
LBC introduces a learnable hybrid behavior mapping and bandit meta-controller for exploration control in deep RL, breaking 24 Atari human world records with 500× less data than prior SOTA.
J. Fan*, Y. Zhuang, Y. Liu, J. Hao, B. Wang, J. Zhu, H. Wang, S.-T. Xia
LBC: learnable hybrid behavior mapping + bandit meta-controller. Unified framework for exploration control in deep RL.
🏅 Ranked 5/4176 · 10,077% mean human score · 24 world records · 500× data efficiency
ICML 20222022
Generalized Data Distribution Iteration Project Page
GDI shows that optimizing the training data distribution is the key lever for superhuman RL efficiency. Provides a unified framework that subsumes diverse RL algorithms as special cases.
J. Fan*, C. Xiao
GDI: optimizing the data distribution is the key to superhuman RL efficiency. Unified framework for diverse RL algorithms.
📈 Agent57 beaten with 500× less data & 2× avg performance
🕸️ Research Paper Network

Hover a node to highlight connections. Papers are grouped by research theme.

🔬 Research Interests
🌊
RL Post-Training for Generative Models
Collapse-free online RLHF for flow/diffusion models. No human-collected preference data needed — models improve from their own generations (ORW-CFM-W2, ADRPO, AC-Flow).
🧠
Reasoning in Multimodal LLMs
Process-reward RL for audio/visual LLMs — fixing test-time inverse scaling so reasoning actually helps, not hurts (CESAR).
🎮
Superhuman-Level Deep RL
Sample-efficient RL that exceeds human performance. Broke 24 Atari world records with 500× less data than prior SOTA (LBC, GDI).
⚡ Impact at a Glance
0
Top Venue Papers
ICLR · NeurIPS · ICML · TPAMI
0
Atari World Records
broken by LBC (ICLR'23 Oral)
0
More Data-Efficient
than Agent57
SOTA
MMAU Audio Reasoning
Beats Gemini 2.5 Pro
0
Google Scholar Citations
4.0
GPA — UIUC Ph.D.
Computer Science
💡 Research Vision

Making AI Systems That Improve Themselves

Today's AI is frozen after training. I work to change that: AI that never stops getting better, with progressively less human scaffolding.

Step 1 — ICLR 2025
Eliminate human-collected preference data
ORW-CFM-W2: online reward-weighted training lets models improve from their own generations — no paired human data needed.
Step 2 — NeurIPS 2025
Remove manual KL tuning
ADRPO: adaptive divergence control eliminates the need for hand-tuned regularization — each sample gets its own constraint.
Step 3 — ICLR 2026
Reward the reasoning process, not just outcomes
CESAR: process-level rewards resolve test-time inverse scaling in Audio LLMs — reasoning finally helps instead of hurts, achieving SOTA on MMAU.
Step 4 — Ongoing
Fully autonomous self-improvement
The endgame: generative models that continuously improve with progressively less human intervention — from data collection to reward design to training itself.
🏅 Awards & Academic Service

🎖 Selected Awards

  • National Scholarship ×2, Top 1% — Nankai Univ.
  • Ranked 1st / 83 in major — Nankai Univ.
  • Outstanding Graduates (Top 1%) — Nankai Univ.
  • Tang Lixin Scholarship (Top 1%)
  • GPA 4.0/4.0 — UIUC Ph.D.
  • GPA 3.97/4.0, Top 1.3% — Tsinghua M.Eng.

🔍 Reviewer

  • ICLR 2024 · 2025 · 2026
  • NeurIPS 2022–2024 · 2025
  • ICML 2023–2024 · 2025 · 2026
  • CVPR 2026
  • AAAI 2025 · AISTATS 2025 · KDD 2024
📅 Conference Deadlines

Key AI/ML venue deadlines I track — for the full list see ccfddl.com.

📬 Contact

Happy to discuss research, internships, or collaborations. Best reached by email.
📧 jiajunf3@illinois.edu  ·  🏛 Siebel Center for CS, UIUC  ·  CV (PDF)

💬
🤖 Research Assistant