Blog

📝 Blog

Notes on RL, generative models, and research observations

2025

Test-Time Inverse Scaling in Audio LLMs

10 minute read

Published: October 27, 2025

Chain-of-thought reasoning helps text LLMs but hurts Audio LLMs. This post explains why — and how process rewards fix it. Read more

The Exploration-Exploitation Dilemma in RLHF for Generative Models

9 minute read

Published: October 20, 2025

A deep dive into why fixed regularization in RLHF leads to diversity collapse, and how adaptive sample-level control resolves the exploration-exploitation dilemma. Read more

Jiajun Fan

Blog

📝 Blog

2025

Test-Time Inverse Scaling in Audio LLMs

The Exploration-Exploitation Dilemma in RLHF for Generative Models