Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards
Published in International Conference on Learning Representations 2026 (ICLR 2026), 2026
Recommended citation: Jiajun Fan, Roger Ren, Jingyuan Li, Rahul Pandey, Prashanth G. Shivakumar, Yile Gu, Ankur Gandhe, Ge Liu, Ivan Bulyko. "Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards." ICLR 2026. https://openreview.net/forum?id=DUr48hxO2h
We propose CESAR, an online RL framework (GRPO) with multi-faceted reasoning process rewards incentivizing consistency, structured analytical patterns, and calibrated depth. Resolves test-time inverse scaling in Audio LLMs; achieves SOTA on MMAU Test-mini, substantially outperforming Gemini 2.5 Pro and GPT-4o Audio.
