CESAR: Consistent, Effective, and Scalable Audio Reasoners
Adding chain-of-thought reasoning to Audio LLMs often degrades performance — a phenomenon we term test-time inverse scaling. Longer reasoning chains yield progressively worse results. Why?
Models without guidance for the reasoning process produce hallucinatory, inconsistent reasoning that accumulates errors over longer chains. Outcome-only rewards cannot fix this.
Process rewards incentivize consistency, structured analytical patterns, causal reasoning, and calibrated depth — transforming reasoning from a liability into a genuine capability.
CESAR uses Group Relative Policy Optimization (GRPO) with a multi-faceted reward suite that goes beyond simple correctness:
Standard outcome reward — is the final answer right?
Does the reasoning chain stay internally consistent?
Does the reasoning follow structured analytical patterns and causal logic?
Is the reasoning depth calibrated to task complexity? Avoid over- and under-thinking.
| Model | MMAU Test-mini | Reasoning Scaling |
|---|---|---|
| GPT-4o Audio | ~65% | ❌ Inverse scaling |
| Gemini 2.5 Pro | ~68% | ❌ Inverse scaling |
| Baseline (outcome-only RL) | ~63% | ❌ Inverse scaling |
| CESAR (Ours) | SOTA Best | ✅ Positive scaling |
@inproceedings{fan2026incentivizing,
title={Incentivizing Consistent, Effective and Scalable Reasoning
Capability in Audio {LLM}s via Reasoning Process Rewards},
author={Jiajun Fan and Roger Ren and Jingyuan Li and Rahul Pandey and
Prashanth Gurunath Shivakumar and Yile Gu and Ankur Gandhe
and Ge Liu and Ivan Bulyko},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=DUr48hxO2h}
}