Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards

Published in International Conference on Learning Representations 2026 (ICLR 2026), 2026

Recommended citation: Jiajun Fan, Roger Ren, Jingyuan Li, Rahul Pandey, Prashanth G. Shivakumar, Yile Gu, Ankur Gandhe, Ge Liu, Ivan Bulyko. "Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards." ICLR 2026. https://openreview.net/forum?id=DUr48hxO2h

We propose CESAR, an online RL framework (GRPO) with multi-faceted reasoning process rewards incentivizing consistency, structured analytical patterns, and calibrated depth. Resolves test-time inverse scaling in Audio LLMs; achieves SOTA on MMAU Test-mini, substantially outperforming Gemini 2.5 Pro and GPT-4o Audio.