Project pages for selected publications โ with method overview, results, and BibTeX
Resolves test-time inverse scaling in Audio LLMs by rewarding the reasoning process.
Sample-level adaptive KL โ high-value samples explore freely, poor samples stay constrained.
No human data, no mode collapse. W2 regularization preserves generation diversity.
Intermediate feedback + dual-stability for robust flow matching fine-tuning on SD3.
Learnable behavior control via hybrid policy mapping + bandit meta-controller.
Unified RL framework showing data distribution is the key to superhuman efficiency.