<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://www.jiajunfan.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.jiajunfan.com/" rel="alternate" type="text/html" /><updated>2026-06-08T04:59:03-07:00</updated><id>https://www.jiajunfan.com/feed.xml</id><title type="html">Jiajun Fan</title><subtitle>CS Ph.D. Student at UIUC · RL Post-Training · Agentic RL · ICLR / ICML / NeurIPS</subtitle><author><name>Jiajun Fan</name><email>jiajunf3@illinois.edu</email></author><entry><title type="html">Test-Time Inverse Scaling in Audio LLMs</title><link href="https://www.jiajunfan.com/posts/2025/10/inverse-scaling-audio-llms/" rel="alternate" type="text/html" title="Test-Time Inverse Scaling in Audio LLMs" /><published>2025-10-27T00:00:00-07:00</published><updated>2025-10-27T00:00:00-07:00</updated><id>https://www.jiajunfan.com/posts/2025/10/inverse-scaling-audio-llms</id><content type="html" xml:base="https://www.jiajunfan.com/posts/2025/10/inverse-scaling-audio-llms/"><![CDATA[<blockquote>
  <p>Chain-of-thought prompting has transformed text-based LLMs. Models like OpenAI o1 and DeepSeek-R1 show that explicit reasoning dramatically improves performance. But something counterintuitive happens in Audio LLMs: <strong>more reasoning makes things worse.</strong> This post explains why, what’s actually going wrong inside the reasoning chains, and how process-level rewards resolve the issue.</p>
</blockquote>

<h2 class="no_toc" id="table-of-contents">Table of Contents</h2>

<ul id="markdown-toc">
  <li><a href="#the-scaling-paradox" id="markdown-toc-the-scaling-paradox">The Scaling Paradox</a></li>
  <li><a href="#why-reasoning-goes-wrong" id="markdown-toc-why-reasoning-goes-wrong">Why Reasoning Goes Wrong</a>    <ul>
      <li><a href="#the-training-gap" id="markdown-toc-the-training-gap">The training gap</a></li>
      <li><a href="#three-failure-modes" id="markdown-toc-three-failure-modes">Three failure modes</a></li>
      <li><a href="#why-doesnt-reasoning-emerge-naturally" id="markdown-toc-why-doesnt-reasoning-emerge-naturally">Why doesn’t reasoning emerge naturally?</a></li>
    </ul>
  </li>
  <li><a href="#process-rewards-supervising-how-the-model-thinks" id="markdown-toc-process-rewards-supervising-how-the-model-thinks">Process Rewards: Supervising How the Model Thinks</a>    <ul>
      <li><a href="#reasoning-consistency-reward" id="markdown-toc-reasoning-consistency-reward">Reasoning consistency reward</a></li>
      <li><a href="#structured-reasoning-keywords" id="markdown-toc-structured-reasoning-keywords">Structured reasoning keywords</a></li>
      <li><a href="#overthinking-penalty" id="markdown-toc-overthinking-penalty">Overthinking penalty</a></li>
    </ul>
  </li>
  <li><a href="#the-reasoning-sweet-spot" id="markdown-toc-the-reasoning-sweet-spot">The Reasoning Sweet Spot</a></li>
  <li><a href="#beyond-reasoning-synergistic-effects" id="markdown-toc-beyond-reasoning-synergistic-effects">Beyond Reasoning: Synergistic Effects</a></li>
  <li><a href="#broader-lessons" id="markdown-toc-broader-lessons">Broader Lessons</a>    <ul>
      <li><a href="#1-process-supervision-matters" id="markdown-toc-1-process-supervision-matters">1. Process supervision matters</a></li>
      <li><a href="#2-more-compute-is-not-automatically-better" id="markdown-toc-2-more-compute-is-not-automatically-better">2. More compute is not automatically better</a></li>
      <li><a href="#3-simple-rewards-can-be-powerful" id="markdown-toc-3-simple-rewards-can-be-powerful">3. Simple rewards can be powerful</a></li>
    </ul>
  </li>
  <li><a href="#summary" id="markdown-toc-summary">Summary</a></li>
  <li><a href="#references" id="markdown-toc-references">References</a></li>
</ul>

<hr />

<h2 id="the-scaling-paradox">The Scaling Paradox</h2>

<p>In text LLMs, there’s a reliable pattern: allocating more compute to reasoning at inference time (longer chain-of-thought, more tokens to “think”) tends to improve performance. This is the foundation of test-time scaling — the principle behind reasoning models like o1, R1, and QwQ.</p>

<p>Audio LLMs tell a different story. When you allow an Audio LLM to reason before answering, performance often <em>degrades</em>. When you let it reason <em>longer</em>, it gets <em>even worse</em>:</p>

\[\frac{\partial P(L_{\text{max}})}{\partial L_{\text{max}}} &lt; 0\]

<p>where \(P(L_{\text{max}})\) is accuracy when the model is allowed up to \(L_{\text{max}}\) reasoning tokens.</p>

<p>We call this <strong>test-time inverse scaling</strong>: the more the model “thinks,” the worse it performs.</p>

<figure style="margin:1.5em 0">
<img src="/images/blog/cesar_scaling.png" style="width:100%;border-radius:10px;border:1px solid #e0e8f0" />
<figcaption style="font-size:0.82em;color:#666;margin-top:8px;text-align:center"><b>Fig 1.</b> Test-time scaling curves. Baseline models (outcome-only RL) show inverse scaling — performance degrades as maximum reasoning length increases. CESAR maintains positive scaling.</figcaption>
</figure>

<p>This isn’t subtle. Models that achieve 70%+ accuracy with direct answers can drop to below 60% when forced to reason. The reasoning itself is actively harmful.</p>

<p>The natural conclusion might be: “reasoning just doesn’t work for audio.” But that’s wrong. The problem isn’t reasoning itself — it’s how models are trained to reason.</p>

<h2 id="why-reasoning-goes-wrong">Why Reasoning Goes Wrong</h2>

<h3 id="the-training-gap">The training gap</h3>

<p>Current RL approaches for Audio LLMs use <strong>outcome-only rewards</strong>:</p>

\[R_{\text{RLVR}}(s_i) = \mathbb{I}[\hat{y}_i = y_i] + \mathbb{I}[\text{ValidFormat}(s_i)]\]

<p>The reward checks two things: “Is the final answer correct?” and “Is the output properly formatted?” That’s all. The entire reasoning chain \(t_i\) — which could be hundreds of tokens — is completely invisible to the reward function.</p>

<p>This means the model gets the same reward whether its reasoning is:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"The audio contains a clear piano melody in the major key, with a tempo around 
120 BPM. The harmonic structure suggests classical rather than jazz..."  → Answer: B
</code></pre></div></div>

<p>or:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>"Let me think about this. I think the answer might be something. Actually let me 
reconsider. The sound is interesting. Maybe it could be several things..."  → Answer: B
</code></pre></div></div>

<p>As long as the final answer is correct, both receive identical reward. <strong>The model has zero gradient signal to improve reasoning quality.</strong></p>

<h3 id="three-failure-modes">Three failure modes</h3>

<p>Without supervision of the reasoning process, three specific pathologies emerge:</p>

<p><strong>1. Hallucinatory reasoning.</strong> The model generates plausible-sounding but factually wrong analysis — describing instruments that aren’t present, inventing acoustic properties, confabulating music theory. Each additional reasoning token is another opportunity to introduce a hallucination, and hallucinations compound across the chain.</p>

<p><strong>2. Reasoning-answer inconsistency.</strong> The reasoning says one thing, but the answer says another. Example:</p>

<blockquote>
  <p><em>“The audio clearly contains a piano solo… therefore the answer is <strong>drums</strong>.”</em></p>
</blockquote>

<p>With outcome-only rewards, the model can learn to produce correct answers <em>despite</em> wrong reasoning. The reasoning becomes decorative noise disconnected from the answer. Longer chains amplify this noise.</p>

<p><strong>3. Unstructured, circular reasoning.</strong> Without guidance on <em>how</em> to reason, models develop rambling patterns: repeating observations, circular logic, tangential elaboration. No systematic analysis structure (elimination, comparison, deduction) emerges.</p>

<h3 id="why-doesnt-reasoning-emerge-naturally">Why doesn’t reasoning emerge naturally?</h3>

<p>In text LLMs, reasoning <em>does</em> emerge somewhat naturally from outcome-only training (though process rewards still help). Why is audio different?</p>

<p>The key difference is the <strong>modality gap</strong>. Text reasoning can leverage massive amounts of pre-training data that already contains reasoning patterns (textbooks, proofs, tutorials). The model has seen millions of examples of structured argumentation.</p>

<p>Audio reasoning has no such foundation. Pre-training on audio-text pairs teaches the model to <em>describe</em> audio, not to <em>reason about</em> it. When you then ask it to reason via RL, it has no templates to draw from — so it generates text that <em>looks</em> like reasoning but isn’t grounded in actual audio analysis.</p>

<h2 id="process-rewards-supervising-how-the-model-thinks">Process Rewards: Supervising How the Model Thinks</h2>

<p>The fix is to reward not just the final answer, but the reasoning process itself. CESAR introduces a <strong>multi-faceted reward suite</strong>:</p>

\[R_{\text{total}}(s_i) = \underbrace{\alpha_1 R_{\text{acc}} + \alpha_2 R_{\text{format}}}_{\text{outcome (existing)}} + \underbrace{\alpha_3 R_{\text{consistency}} + \alpha_4 R_{\text{keywords}} + \alpha_5 R_{\text{overthinking}}}_{\text{process (new)}}\]

<p>Each process reward targets a specific failure mode:</p>

<figure style="margin:1.5em 0">
<img src="/images/blog/cesar_framework.png" style="width:100%;border-radius:10px;border:1px solid #e0e8f0" />
<figcaption style="font-size:0.82em;color:#666;margin-top:8px;text-align:center"><b>Fig 2.</b> CESAR framework. Left: existing outcome-only methods reward only final answer correctness. Right: CESAR adds process rewards for reasoning consistency, structure, and depth.</figcaption>
</figure>

<h3 id="reasoning-consistency-reward">Reasoning consistency reward</h3>

\[R_{\text{consistency}}(s_i) = \text{Sim}(t_i, \hat{y}_i) + \text{Sim}(t_i, Q_i)\]

<p>This measures semantic alignment in two directions:</p>

<ul>
  <li>
    <p><strong>Reasoning ↔ Answer</strong> (\(\text{Sim}(t_i, \hat{y}_i)\)): Does the reasoning logically lead to the chosen answer? This directly combats reasoning-answer inconsistency.</p>
  </li>
  <li>
    <p><strong>Reasoning ↔ Question</strong> (\(\text{Sim}(t_i, Q_i)\)): Is the reasoning actually about the question asked? This prevents hallucination — the model can’t ramble about unrelated topics.</p>
  </li>
</ul>

<p>Similarity is computed via concept overlap:</p>

\[\text{Sim}(x, y) = \frac{|\text{Concepts}(x) \cap \text{Concepts}(y)|}{\max(|\text{Concepts}(x)|, |\text{Concepts}(y)|)}\]

<p>This is deliberately simple — no learned model, no LLM judge, just lexical overlap of key concepts. The simplicity is a feature: it provides a stable, deterministic gradient signal without introducing the complexity and variance of learned reward models.</p>

<h3 id="structured-reasoning-keywords">Structured reasoning keywords</h3>

\[R_{\text{keywords}} = R_{\text{pattern}} + R_{\text{logic}} + R_{\text{domain}}\]

<p>Three components that serve as cognitive scaffolding:</p>

<p><strong>Analytical patterns</strong> (\(R_{\text{pattern}}\)): Detects and rewards structured reasoning strategies — elimination (“Option A is unlikely because…”), comparison (“Comparing the tempo to…”), step-by-step analysis (“First, I’ll analyze the frequency spectrum…”). These patterns don’t emerge reliably from outcome-only training.</p>

<p><strong>Logical connectives</strong> (\(R_{\text{logic}}\)): Rewards markers of genuine logical reasoning — causal chains (“therefore,” “because”), hypotheticals (“if the tempo were…”), evidence-based conclusions (“based on the harmonic content…”). Promotes coherent logical progression rather than stream-of-consciousness.</p>

<p><strong>Domain vocabulary</strong> (\(R_{\text{domain}}\)): Rewards audio-specific terminology — acoustic properties (timbre, frequency, amplitude), musical concepts (tempo, key, chord progression), speech features (prosody, formants). This grounds reasoning in the actual audio signal rather than generic text patterns.</p>

<h3 id="overthinking-penalty">Overthinking penalty</h3>

\[R_{\text{overthinking}}(s_i) = 1 - \frac{|t_i|}{L_{\text{max}}}\]

<p>A linear penalty on reasoning length. This is the essential counterpart to keyword rewards — without it, the model could game the keyword reward by producing verbose, repetitive reasoning stuffed with trigger words.</p>

<p>The penalty forces the model to be <em>concise</em>. Combined with the keyword reward, the optimization pressure becomes: “reason in a structured, domain-grounded, logically coherent way — and do it efficiently.”</p>

<h2 id="the-reasoning-sweet-spot">The Reasoning Sweet Spot</h2>

<p>With process rewards, test-time inverse scaling reverses. Performance now <em>increases</em> with reasoning length, up to a model-specific optimum:</p>

\[L_{\text{sweet}} = \arg\max_L P(L)\]

<p>CESAR discovers “reasoning sweet spots” where the model achieves peak performance. Below this point, the model lacks sufficient analytical depth. Beyond it, returns diminish (but don’t degrade — a critical difference from baselines).</p>

<figure style="margin:1.5em 0">
<img src="/images/blog/cesar_slope.png" style="width:100%;border-radius:10px;border:1px solid #e0e8f0" />
<figcaption style="font-size:0.82em;color:#666;margin-top:8px;text-align:center"><b>Fig 3.</b> Scaling slopes. CESAR is the only method with consistently positive slope — reasoning genuinely helps. Baselines show negative or flat slopes.</figcaption>
</figure>

<p>The <strong>scaling slope</strong> — whether performance increases or decreases with more reasoning — is perhaps the most important metric. A positive slope means the model has genuinely learned to reason. A negative slope means it’s just generating noise that happens to sometimes contain correct answers.</p>

<h2 id="beyond-reasoning-synergistic-effects">Beyond Reasoning: Synergistic Effects</h2>

<p>An unexpected finding: improving reasoning quality also improves <strong>perception</strong> capabilities. Models trained with process rewards become better at basic audio understanding tasks (identifying instruments, recognizing speakers, detecting events) — even though these tasks don’t require explicit reasoning.</p>

<figure style="margin:1.5em 0">
<img src="/images/blog/cesar_radar.png" style="width:100%;border-radius:10px;border:1px solid #e0e8f0" />
<figcaption style="font-size:0.82em;color:#666;margin-top:8px;text-align:center"><b>Fig 4.</b> Multi-dimensional evaluation radar. CESAR improves not just reasoning quality but also perception accuracy across audio understanding tasks.</figcaption>
</figure>

<p>This suggests that reasoning and perception are deeply entangled in multimodal models. Learning to reason <em>about</em> audio signals forces the model to develop better internal representations <em>of</em> those signals. The process rewards act as an implicit regularizer that shapes how the model attends to and processes audio features.</p>

<h2 id="broader-lessons">Broader Lessons</h2>

<h3 id="1-process-supervision-matters">1. Process supervision matters</h3>

<p>The parallel to education is direct. If you only grade students on final exam answers, some will learn the material and some will learn to guess well. If you also evaluate their work (proofs, derivations, reasoning steps), you select for genuine understanding. Outcome-only rewards are like grading only final answers.</p>

<h3 id="2-more-compute-is-not-automatically-better">2. More compute is not automatically better</h3>

<p>Test-time inverse scaling is a cautionary tale for the “just scale it” mentality. Without proper training, additional inference-time compute is wasted or harmful. The bottleneck is not compute — it’s the quality of what the model does with that compute.</p>

<h3 id="3-simple-rewards-can-be-powerful">3. Simple rewards can be powerful</h3>

<p>CESAR’s process rewards are deliberately simple — concept overlap, keyword detection, length penalty. No learned reward models, no LLM judges, no complex architectures. The power comes from providing <em>any</em> gradient signal on reasoning quality, where previously there was none. Going from zero supervision to basic supervision of the reasoning process is the critical step.</p>

<h2 id="summary">Summary</h2>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Problem</th>
      <th style="text-align: left">Cause</th>
      <th style="text-align: left">Solution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">Hallucinatory reasoning</td>
      <td style="text-align: left">No feedback on reasoning content</td>
      <td style="text-align: left">Consistency reward (reasoning ↔ question)</td>
    </tr>
    <tr>
      <td style="text-align: left">Reasoning-answer disconnect</td>
      <td style="text-align: left">Reward ignores reasoning chain</td>
      <td style="text-align: left">Consistency reward (reasoning ↔ answer)</td>
    </tr>
    <tr>
      <td style="text-align: left">Unstructured thinking</td>
      <td style="text-align: left">No template for audio reasoning</td>
      <td style="text-align: left">Keyword rewards (pattern + logic + domain)</td>
    </tr>
    <tr>
      <td style="text-align: left">Verbose rambling</td>
      <td style="text-align: left">No length pressure</td>
      <td style="text-align: left">Overthinking penalty</td>
    </tr>
    <tr>
      <td style="text-align: left">Test-time inverse scaling</td>
      <td style="text-align: left">All of the above compounding</td>
      <td style="text-align: left">Multi-faceted process rewards</td>
    </tr>
  </tbody>
</table>

<p>The key insight: <strong>test-time inverse scaling is a training problem, not a fundamental limitation of reasoning in audio.</strong> Process rewards resolve it by providing the gradient signal that outcome-only rewards lack.</p>

<h2 id="references">References</h2>

<p>[1] Fan et al. “Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards.” ICLR 2026. <a href="https://openreview.net/forum?id=DUr48hxO2h">Paper</a></p>

<p>[2] Shao et al. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” 2024. (GRPO)</p>

<p>[3] Wei et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS 2022.</p>

<p>[4] OpenAI. “Learning to Reason with LLMs.” 2024. (o1)</p>

<p>[5] DeepSeek-AI. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” 2025.</p>

<hr />

<p><em>Cited as:</em></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Fan, Jiajun. "Test-Time Inverse Scaling in Audio LLMs."
jiajunfan.com (2025).
</code></pre></div></div>]]></content><author><name>Jiajun Fan</name><email>jiajunf3@illinois.edu</email></author><category term="reinforcement learning" /><category term="audio LLMs" /><category term="reasoning" /><category term="test-time scaling" /><summary type="html"><![CDATA[Chain-of-thought reasoning helps text LLMs but hurts Audio LLMs. This post explains why — and how process rewards fix it.]]></summary></entry><entry><title type="html">The Exploration-Exploitation Dilemma in RLHF for Generative Models</title><link href="https://www.jiajunfan.com/posts/2025/10/diversity-collapse-rlhf/" rel="alternate" type="text/html" title="The Exploration-Exploitation Dilemma in RLHF for Generative Models" /><published>2025-10-20T00:00:00-07:00</published><updated>2025-10-20T00:00:00-07:00</updated><id>https://www.jiajunfan.com/posts/2025/10/diversity-collapse-rlhf</id><content type="html" xml:base="https://www.jiajunfan.com/posts/2025/10/diversity-collapse-rlhf/"><![CDATA[<blockquote>
  <p>RLHF fine-tuning of generative models faces a fundamental tension: you want to maximize reward (make outputs better), but doing so aggressively destroys the diversity of what the model can produce. This post explains why this happens, why it’s hard to fix with a single hyperparameter, and how advantage-based adaptive regularization provides a principled solution.</p>
</blockquote>

<h2 class="no_toc" id="table-of-contents">Table of Contents</h2>

<ul id="markdown-toc">
  <li><a href="#background-the-rlhf-objective" id="markdown-toc-background-the-rlhf-objective">Background: The RLHF Objective</a></li>
  <li><a href="#the-diversity-collapse-problem" id="markdown-toc-the-diversity-collapse-problem">The Diversity Collapse Problem</a>    <ul>
      <li><a href="#what-happens-without-regularization" id="markdown-toc-what-happens-without-regularization">What happens without regularization</a></li>
      <li><a href="#why-diversity-matters" id="markdown-toc-why-diversity-matters">Why diversity matters</a></li>
      <li><a href="#the-standard-fix-fixed-regularization" id="markdown-toc-the-standard-fix-fixed-regularization">The standard fix: fixed regularization</a></li>
    </ul>
  </li>
  <li><a href="#the-fixed-β-dilemma" id="markdown-toc-the-fixed-β-dilemma">The Fixed-β Dilemma</a></li>
  <li><a href="#adaptive-divergence-regularized-policy-optimization-adrpo" id="markdown-toc-adaptive-divergence-regularized-policy-optimization-adrpo">Adaptive Divergence Regularized Policy Optimization (ADRPO)</a>    <ul>
      <li><a href="#core-idea" id="markdown-toc-core-idea">Core idea</a></li>
      <li><a href="#for-flow-matching-models" id="markdown-toc-for-flow-matching-models">For flow matching models</a></li>
      <li><a href="#for-language-models" id="markdown-toc-for-language-models">For language models</a></li>
    </ul>
  </li>
  <li><a href="#what-emerges-in-practice" id="markdown-toc-what-emerges-in-practice">What Emerges in Practice</a>    <ul>
      <li><a href="#smaller-models-outperform-larger-ones" id="markdown-toc-smaller-models-outperform-larger-ones">Smaller models outperform larger ones</a></li>
      <li><a href="#emergent-exploration-in-llms" id="markdown-toc-emergent-exploration-in-llms">Emergent exploration in LLMs</a></li>
      <li><a href="#cross-domain-generality" id="markdown-toc-cross-domain-generality">Cross-domain generality</a></li>
    </ul>
  </li>
  <li><a href="#summary" id="markdown-toc-summary">Summary</a></li>
  <li><a href="#references" id="markdown-toc-references">References</a></li>
</ul>

<hr />

<h2 id="background-the-rlhf-objective">Background: The RLHF Objective</h2>

<p>Suppose you have a pre-trained generative model \(\pi_{\text{ref}}\) — a text-to-image model like Stable Diffusion 3, or a language model like Qwen. You want to fine-tune it so that its outputs score higher on some reward function \(R(x, c)\), where \(x\) is the generated output and \(c\) is the conditioning context (e.g., a text prompt).</p>

<p>The standard approach formulates this as a regularized optimization problem:</p>

\[J(\theta) = \underbrace{\mathbb{E}_{x \sim \pi_\theta, c \sim p(c)}[R(x, c)]}_{\text{maximize reward}} - \underbrace{\beta \cdot D(\pi_\theta, \pi_{\text{ref}})}_{\text{stay close to pre-trained model}}\]

<p>The first term says “produce high-reward outputs.” The second term says “don’t stray too far from the original model.” The coefficient \(\beta\) controls this trade-off, and \(D\) is some divergence measure — KL divergence for language models, Wasserstein-2 (W2) distance for flow matching models.</p>

<p>This is clean and intuitive. But in practice, getting \(\beta\) right is surprisingly hard — and the consequences of getting it wrong are severe.</p>

<h2 id="the-diversity-collapse-problem">The Diversity Collapse Problem</h2>

<h3 id="what-happens-without-regularization">What happens without regularization</h3>

<p>Let’s start with the extreme case: \(\beta = 0\), no regularization at all. The model is free to do whatever maximizes reward.</p>

<p>In online RL, the model generates samples, gets rewards, and updates itself. Without any constraint:</p>

<ol>
  <li>Early in training, the model discovers a few high-reward outputs</li>
  <li>It shifts probability mass toward those outputs</li>
  <li>Now it generates similar outputs more often → they get reinforced further</li>
  <li>This positive feedback loop concentrates all probability mass on a narrow region</li>
</ol>

<p>The equilibrium is a <strong>delta distribution</strong>: the model always generates the same output for a given prompt, regardless of the diversity of valid responses.</p>

\[\pi^*_{\beta=0}(x|c) \to \delta(x - x^*_c), \quad \text{where } x^*_c = \arg\max_x R(x, c)\]

<p>In image generation, this means every prompt produces essentially the same “optimal” image — perhaps technically high-scoring but boring and unusable. In language, the model repeats the same phrasing over and over.</p>

<p>This is <strong>mode collapse</strong> (or <strong>diversity collapse</strong>), and it’s the central failure mode of unconstrained RLHF.</p>

<div style="display:grid;grid-template-columns:1fr 1fr;gap:16px;margin:1.5em 0">
<figure style="margin:0;text-align:center">
<img src="/images/blog/mnist_collapse.png" style="width:100%;border-radius:8px;border:1px solid #e0e8f0" />
<figcaption style="font-size:0.82em;color:#666;margin-top:6px"><b>Mode collapse (α=0)</b>: model generates identical digits, maximizing reward but destroying diversity.</figcaption>
</figure>
<figure style="margin:0;text-align:center">
<img src="/images/blog/mnist_balanced.png" style="width:100%;border-radius:8px;border:1px solid #e0e8f0" />
<figcaption style="font-size:0.82em;color:#666;margin-top:6px"><b>Balanced (α=0.3)</b>: with proper regularization, the model improves quality while preserving digit variety.</figcaption>
</figure>
</div>

<h3 id="why-diversity-matters">Why diversity matters</h3>

<p>One might ask: if the goal is to maximize reward, who cares about diversity? Several reasons:</p>

<ol>
  <li>
    <p><strong>Reward functions are imperfect proxies.</strong> They approximate human preferences, not capture them fully. A collapsed model that exploits quirks in the reward function (reward hacking) produces outputs that score high but look terrible to humans.</p>
  </li>
  <li>
    <p><strong>Generalization.</strong> A diverse model handles novel prompts gracefully. A collapsed model performs well only on prompts similar to its training distribution.</p>
  </li>
  <li>
    <p><strong>Downstream utility.</strong> In creative applications (art, writing, brainstorming), diversity is itself valuable. A model that always gives the same answer is useless for creative tasks.</p>
  </li>
  <li>
    <p><strong>Training stability.</strong> Once the model concentrates on a narrow manifold, gradient signals become noisy and training destabilizes.</p>
  </li>
</ol>

<h3 id="the-standard-fix-fixed-regularization">The standard fix: fixed regularization</h3>

<p>The conventional solution is to set \(\beta &gt; 0\). For flow matching models, we typically use a W2 regularization term:</p>

\[\mathcal{L}_{\text{ORW-CFM-W2}} = \underbrace{\mathcal{L}_{\text{ORW}}(\theta)}_{\text{reward-weighted flow matching}} + \beta \cdot \underbrace{\mathbb{E}_{c,t,x_t}\left[\|\mathbf{v}_\theta(x_t, t, c) - \mathbf{v}_{\text{ref}}(x_t, t, c)\|^2\right]}_{\text{W2 regularization}}\]

<p>For language models with GRPO:</p>

\[\mathcal{L}_{\text{GRPO}} = \mathcal{L}_{\text{PG}}(\theta) + \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\]

<p>This works — to a point. But it creates a fundamental dilemma.</p>

<h2 id="the-fixed-β-dilemma">The Fixed-β Dilemma</h2>

<p>Here’s the core problem: <strong>different samples need different levels of regularization.</strong></p>

<p>Consider two samples generated during training for the same prompt:</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left"> </th>
      <th style="text-align: left">Sample A (high reward)</th>
      <th style="text-align: left">Sample B (low reward)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>Advantage</strong></td>
      <td style="text-align: left">\(A &gt; 0\) (better than average)</td>
      <td style="text-align: left">\(A &lt; 0\) (worse than average)</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>What we want</strong></td>
      <td style="text-align: left">Exploit: push further in this direction</td>
      <td style="text-align: left">Explore cautiously: pull back toward reference</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Ideal \(\beta\)</strong></td>
      <td style="text-align: left">Low (give freedom to optimize)</td>
      <td style="text-align: left">High (enforce stability)</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>What fixed \(\beta\) does</strong></td>
      <td style="text-align: left">Same constraint as bad sample</td>
      <td style="text-align: left">Same constraint as good sample</td>
    </tr>
  </tbody>
</table>

<p>Fixed \(\beta\) applies the same regularization pressure to every sample regardless of quality. If \(\beta\) is high enough to prevent collapse on bad samples, it also unnecessarily constrains good samples. If \(\beta\) is low enough to exploit good samples, it fails to stabilize bad samples.</p>

<p>This is not a tuning problem — no single value of \(\beta\) is optimal for all samples simultaneously.</p>

<figure style="margin:1.5em 0">
<img src="/images/blog/adrpo_reward_diversity.png" style="width:100%;border-radius:10px;border:1px solid #e0e8f0" />
<figcaption style="font-size:0.82em;color:#666;margin-top:8px;text-align:center"><b>Fig 1.</b> Reward vs. diversity Pareto front on SD3 text-to-image. Adaptive regularization (ADRPO) achieves a dominant frontier — higher reward at every diversity level compared to fixed-β methods like DPO and ORW-CFM-W2.</figcaption>
</figure>

<h2 id="adaptive-divergence-regularized-policy-optimization-adrpo">Adaptive Divergence Regularized Policy Optimization (ADRPO)</h2>

<h3 id="core-idea">Core idea</h3>

<p>The insight is simple: <strong>make regularization strength a function of sample quality.</strong> High-quality samples get less regularization (exploit); low-quality samples get more (explore safely).</p>

<p>The general ADRPO objective:</p>

\[\boxed{\mathcal{L}_{\text{ADRPO}}(\theta) = \mathcal{L}_{\text{RL}}(\theta) + (\beta_0 - A) \cdot \mathcal{L}_D(\theta)}\]

<p>where \(A\) is the advantage estimate for the current sample, \(\beta_0\) is a baseline coefficient, and \(\mathcal{L}_D\) is the divergence regularization loss.</p>

<p>The effective regularization coefficient becomes:</p>

\[\beta_{\text{eff}} = \beta_0 - A\]

<table>
  <thead>
    <tr>
      <th style="text-align: left">Sample quality</th>
      <th style="text-align: left">Advantage \(A\)</th>
      <th style="text-align: left">Effective \(\beta\)</th>
      <th style="text-align: left">Effect</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">High reward</td>
      <td style="text-align: left">\(A &gt; 0\)</td>
      <td style="text-align: left">\(\beta_0 - A\) ↓</td>
      <td style="text-align: left">Less regularization → exploit</td>
    </tr>
    <tr>
      <td style="text-align: left">Average</td>
      <td style="text-align: left">\(A \approx 0\)</td>
      <td style="text-align: left">\(\approx \beta_0\)</td>
      <td style="text-align: left">Standard regularization</td>
    </tr>
    <tr>
      <td style="text-align: left">Low reward</td>
      <td style="text-align: left">\(A &lt; 0\)</td>
      <td style="text-align: left">\(\beta_0 - A\) ↑</td>
      <td style="text-align: left">More regularization → stabilize</td>
    </tr>
  </tbody>
</table>

<p>This is a one-line modification to existing RLHF objectives. No new networks, no complex architecture changes.</p>

<h3 id="for-flow-matching-models">For flow matching models</h3>

<p>For flow matching models like SD3, ADRPO combines advantage-weighted flow matching with adaptive W2 regularization:</p>

\[\mathcal{L}_{\text{ADRPO-FM}}(\theta) = \underbrace{\mathbb{E}\left[A(x_1,c) \cdot \|\mathbf{v}_\theta(x_t, t, c) - \mathbf{u}_t\|^2\right]}_{\text{advantage-weighted flow matching}} + \underbrace{(\beta_0 - A(x_1,c)) \cdot \mathbb{E}\left[\|\mathbf{v}_\theta - \mathbf{v}_{\text{ref}}\|^2\right]}_{\text{adaptive W2 regularization}}\]

<p>Notice that \(A\) appears in <em>both</em> terms. In the first term, the advantage provides <em>bidirectional</em> learning signals:</p>

<ul>
  <li>\(A &gt; 0\): gradient encourages matching the target velocity (strengthen good generation)</li>
  <li>\(A &lt; 0\): gradient <em>reverses</em>, actively pushing the model <em>away</em> from bad generation</li>
</ul>

<p>This is fundamentally different from reward-weighting (ORW-CFM-W2), which can only down-weight bad samples with non-negative weights but never actively suppress them. In high-dimensional spaces like image generation, where bad regions vastly outnumber good ones, this difference is critical.</p>

<p>The advantage is estimated simply as:</p>

\[A(x_1, c) = R(x_1, c) - V(c), \quad \text{where } V(c) = \frac{1}{|\mathcal{B}|}\sum_{x \in \mathcal{B}} R(x, c)\]

<p>i.e., the reward minus the batch-average reward for the same prompt.</p>

<h3 id="for-language-models">For language models</h3>

<p>For LLMs, ADRPO integrates with GRPO by making the KL coefficient advantage-dependent:</p>

\[\mathcal{L}_{\text{ADRPO-GRPO}}(\theta) = \mathcal{L}_{\text{PG}}(\theta) + (\beta_0 - A_{\text{GRPO}}) \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\]

<p>where \(A_{\text{GRPO}}\) is the group-level advantage from GRPO (reward minus group mean, normalized).</p>

<figure style="margin:1.5em 0">
<img src="/images/blog/adrpo_reward_kl.png" style="width:100%;border-radius:10px;border:1px solid #e0e8f0" />
<figcaption style="font-size:0.82em;color:#666;margin-top:8px;text-align:center"><b>Fig 2.</b> Reward vs. KL divergence on SD3. ADRPO reaches the same reward level at lower KL divergence — more efficient policy improvement.</figcaption>
</figure>

<h2 id="what-emerges-in-practice">What Emerges in Practice</h2>

<h3 id="smaller-models-outperform-larger-ones">Smaller models outperform larger ones</h3>

<p>With ADRPO, a <strong>2B parameter SD3</strong> model outperforms <strong>FLUX.1-Dev (12B)</strong> and <strong>SANA-1.5 (4.8B)</strong> in attribute binding, semantic consistency, and compositional control. The adaptive regularization extracts more from each parameter by allocating exploration budget where it matters most.</p>

<h3 id="emergent-exploration-in-llms">Emergent exploration in LLMs</h3>

<p>When applied to LLM fine-tuning, ADRPO exhibits an unexpected emergent behavior: <strong>the ability to escape local optima.</strong></p>

<figure style="margin:1.5em 0">
<img src="/images/blog/adrpo_llm_entropy.png" style="width:100%;border-radius:10px;border:1px solid #e0e8f0" />
<figcaption style="font-size:0.82em;color:#666;margin-top:8px;text-align:center"><b>Fig 3.</b> Reward vs. entropy on LLM fine-tuning (Qwen2). ADRPO achieves higher reward while maintaining generation diversity. The adaptive mechanism allows the model to escape local optima that trap fixed-β methods.</figcaption>
</figure>

<p>When the model is stuck in a poor solution (all advantages near zero or negative), the adaptive coefficient \(\beta_0 - A\) increases globally, pulling the model back toward the pre-trained distribution — effectively “resetting” exploration. When it finds a promising direction (high advantages), regularization drops, allowing rapid exploitation. This creates a natural curriculum that no fixed coefficient can replicate.</p>

<h3 id="cross-domain-generality">Cross-domain generality</h3>

<p>ADRPO is not limited to one modality or one divergence measure. It has been validated across:</p>

<ul>
  <li><strong>Flow matching</strong> (SD3) with W2 regularization</li>
  <li><strong>Text LLMs</strong> (Qwen2, Qwen3) with KL divergence</li>
  <li><strong>Audio reasoning LLMs</strong> with GRPO</li>
</ul>

<p>The same principle — advantage-based adaptive regularization — provides consistent improvement regardless of the underlying architecture.</p>

<h2 id="summary">Summary</h2>

<p>The exploration-exploitation dilemma in RLHF is fundamental: <strong>no single regularization coefficient is optimal for all samples.</strong> ADRPO resolves this by making \(\beta\) a function of advantage:</p>

\[\beta_{\text{eff}} = \beta_0 - A\]

<p>One line of math; dramatic practical consequences. For details, see <a href="https://openreview.net/forum?id=aXO0xg0ttW">the paper (NeurIPS 2025)</a>.</p>

<h2 id="references">References</h2>

<p>[1] Fan et al. “Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models.” NeurIPS 2025. <a href="https://openreview.net/forum?id=aXO0xg0ttW">Paper</a></p>

<p>[2] Fan et al. “Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization.” ICLR 2025. <a href="https://openreview.net/forum?id=2IoFFexvuw">Paper</a></p>

<p>[3] Shao et al. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” 2024. (GRPO)</p>

<p>[4] Esser et al. “Scaling Rectified Flow Transformers for High-Resolution Image Synthesis.” ICML 2024. (Stable Diffusion 3)</p>

<p>[5] Rafailov et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” NeurIPS 2023. (DPO)</p>

<hr />

<p><em>Cited as:</em></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Fan, Jiajun. "The Exploration-Exploitation Dilemma in RLHF for Generative Models."
jiajunfan.com (2025).
</code></pre></div></div>]]></content><author><name>Jiajun Fan</name><email>jiajunf3@illinois.edu</email></author><category term="reinforcement learning" /><category term="RLHF" /><category term="generative models" /><category term="flow matching" /><summary type="html"><![CDATA[A deep dive into why fixed regularization in RLHF leads to diversity collapse, and how adaptive sample-level control resolves the exploration-exploitation dilemma.]]></summary></entry></feed>