← Jiajun Fan / Publications / CESAR
🏆 ICLR 2026

Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards

CESAR: Consistent, Effective, and Scalable Audio Reasoners

Jiajun Fan1,2, Roger Ren 2, Jingyuan Li 2, Rahul Pandey 2, Prashanth G. Shivakumar 2, Yile Gu 2, Ankur Gandhe 2, Ge Liu 1, Ivan Bulyko 2
1University of Illinois Urbana-Champaign    2Amazon
📄 Paper (OpenReview) arXiv 🏠 Author Homepage
TL;DR — Audio LLMs trained only on outcome rewards produce hallucinatory, inconsistent, and unscalable reasoning chains. CESAR shifts to rewarding the reasoning process itself via online RL (GRPO), resolving test-time inverse scaling and achieving SOTA on MMAU — outperforming Gemini 2.5 Pro and GPT-4o Audio.

🔍 The Problem

Adding chain-of-thought reasoning to Audio LLMs often degrades performance — a phenomenon we term test-time inverse scaling. Longer reasoning chains yield progressively worse results. Why?

❌ Without CESAR

Models without guidance for the reasoning process produce hallucinatory, inconsistent reasoning that accumulates errors over longer chains. Outcome-only rewards cannot fix this.

✅ With CESAR

Process rewards incentivize consistency, structured analytical patterns, causal reasoning, and calibrated depth — transforming reasoning from a liability into a genuine capability.

⚙️ Method

CESAR uses Group Relative Policy Optimization (GRPO) with a multi-faceted reward suite that goes beyond simple correctness:

1

Correctness Reward

Standard outcome reward — is the final answer right?

2

Consistency Reward

Does the reasoning chain stay internally consistent?

3

Structure Reward

Does the reasoning follow structured analytical patterns and causal logic?

4

Depth Calibration

Is the reasoning depth calibrated to task complexity? Avoid over- and under-thinking.

📊 Results

SOTA
MMAU Test-mini
benchmark
> 2.5 Pro
Outperforms
Gemini 2.5 Pro
> GPT-4o
Outperforms
GPT-4o Audio
✓ Scale
Reasoning scales
positively at test time
ModelMMAU Test-miniReasoning Scaling
GPT-4o Audio~65%❌ Inverse scaling
Gemini 2.5 Pro~68%❌ Inverse scaling
Baseline (outcome-only RL)~63%❌ Inverse scaling
CESAR (Ours)SOTA Best✅ Positive scaling

📅 Publication Journey

Oct 2025
arXiv preprint released (arXiv:2510.20867)
Oct 2025
Submitted to ICLR 2026 (Submission #8335)
Submitted to The Fourteenth International Conference on Learning Representations.
Jan 2026
✅ Accepted at ICLR 2026 (Poster)
Accepted. OpenReview
Apr 2026
Presented at ICLR 2026 · Rio de Janeiro, Brazil

📖 BibTeX

@inproceedings{fan2026incentivizing,
  title={Incentivizing Consistent, Effective and Scalable Reasoning
         Capability in Audio {LLM}s via Reasoning Process Rewards},
  author={Jiajun Fan and Roger Ren and Jingyuan Li and Rahul Pandey and
          Prashanth Gurunath Shivakumar and Yile Gu and Ankur Gandhe
          and Ge Liu and Ivan Bulyko},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=DUr48hxO2h}
}