🏆 ICLR 2026

Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards

CESAR: Consistent, Effective, and Scalable Audio Reasoners

Jiajun Fan ^1,2, Roger Ren ², Jingyuan Li ², Rahul Pandey ², Prashanth G. Shivakumar ², Yile Gu ², Ankur Gandhe ², Ge Liu ¹, Ivan Bulyko ²

¹University of Illinois Urbana-Champaign ²Amazon

📄 Paper (OpenReview) arXiv 🏠 Author Homepage

TL;DR — Audio LLMs trained only on outcome rewards produce hallucinatory, inconsistent, and unscalable reasoning chains. CESAR shifts to rewarding the reasoning process itself via online RL (GRPO), resolving test-time inverse scaling and achieving SOTA on MMAU — outperforming Gemini 2.5 Pro and GPT-4o Audio.

🔍 The Problem

Adding chain-of-thought reasoning to Audio LLMs often degrades performance — a phenomenon we term test-time inverse scaling. Longer reasoning chains yield progressively worse results. Why?

❌ Without CESAR

Models without guidance for the reasoning process produce hallucinatory, inconsistent reasoning that accumulates errors over longer chains. Outcome-only rewards cannot fix this.

✅ With CESAR

Process rewards incentivize consistency, structured analytical patterns, causal reasoning, and calibrated depth — transforming reasoning from a liability into a genuine capability.

⚙️ Method

CESAR uses Group Relative Policy Optimization (GRPO) with a multi-faceted reward suite that goes beyond simple correctness:

Correctness Reward

Standard outcome reward — is the final answer right?

Consistency Reward

Does the reasoning chain stay internally consistent?

Structure Reward

Does the reasoning follow structured analytical patterns and causal logic?

Depth Calibration

Is the reasoning depth calibrated to task complexity? Avoid over- and under-thinking.

📊 Results

SOTA

MMAU Test-mini
benchmark

> 2.5 Pro

Outperforms
Gemini 2.5 Pro

> GPT-4o

Outperforms
GPT-4o Audio

✓ Scale

Reasoning scales
positively at test time

Model	MMAU Test-mini	Reasoning Scaling
GPT-4o Audio	~65%	❌ Inverse scaling
Gemini 2.5 Pro	~68%	❌ Inverse scaling
Baseline (outcome-only RL)	~63%	❌ Inverse scaling
CESAR (Ours)	SOTA Best	✅ Positive scaling

📅 Publication Journey

Oct 2025

arXiv preprint released (arXiv:2510.20867)

Oct 2025

Submitted to ICLR 2026 (Submission #8335)

Submitted to The Fourteenth International Conference on Learning Representations.

Jan 2026

✅ Accepted at ICLR 2026 (Poster)

Accepted. OpenReview

Apr 2026

Presented at ICLR 2026 · Rio de Janeiro, Brazil

📖 BibTeX

@inproceedings{fan2026incentivizing,
  title={Incentivizing Consistent, Effective and Scalable Reasoning
         Capability in Audio {LLM}s via Reasoning Process Rewards},
  author={Jiajun Fan and Roger Ren and Jingyuan Li and Rahul Pandey and
          Prashanth Gurunath Shivakumar and Yile Gu and Ankur Gandhe
          and Ge Liu and Ivan Bulyko},
  booktitle={The Fourteenth International Conference on Learning Representations},
  year={2026},
  url={https://openreview.net/forum?id=DUr48hxO2h}
}