The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Summary

A unified look at OPD and OPSD across reasoning, prompting, and alignment.

On-policy distillation (OPD) trains a student on its own rollouts while a teacher provides dense token-level supervision. Reported results have been inconsistent — some papers see big wins, others see collapse. We run side-by-side experiments to map where each variant succeeds and where it fails, isolate the failure modes, and offer concrete fixes that restore stability.

Overview figure showing the OPD/OPSD design space, three failure modes, and proposed fixes. — The OP(S)D design space, observed task-dependent behaviors, three failure mechanisms, and the corresponding fixes.

Background

OPD vs. OPSD in one breath.

Both methods sample trajectories from the student and ask a teacher to score them token-by-token; they differ only in where the teacher comes from. In OPD, the teacher is a separate, typically larger model, and privileged information (PI) is optional — the promise is a cheap way to transfer capability into a smaller student. In OPSD, the teacher is the student itself, conditioned on extra information the student does not see at test time: a ground-truth answer, a system prompt, or a preference rule. The same loss; very different inductive biases.

Diagram of OPD and OPSD setups. — In OPSD, the teacher is built from the student plus PI. In OPD, the teacher is an external stronger model and PI is optional.

Results

Where each method lands.

Across the regimes we study, OPSD's success tracks one feature of the privileged information: whether it encodes a shared latent rule across examples, or an instance-specific answer. OPD on math reasoning is sensitive to teacher choice and loss formulation; OPSD on math reasoning fails for a deeper structural reason.

Fails

OPSD on math reasoning

Per-question PI varies from problem to problem, pushing the student toward a marginal, PI-free policy that no individual teacher endorses.

Fragile

OPD on math reasoning

Initial gains, then collapse: rollouts grow long, fill with hedging tokens, and accuracy crashes near zero.

Works

System-prompt internalization

A fixed prompt acts as a shared rule. OPSD compresses prompted behavior into the model with no accuracy loss.

Works

Style alignment

On CharacterBench and EmotionBench, OPSD converges faster than GRPO and PPO at matched sampling budgets.

A prefix-conditioning probe (GPQA-Diamond, Qwen3-14B teacher). When the teacher continues from its own reasoning, it is far more accurate than when it is forced to continue from a student-written prefix.

62.1%

Teacher standalone

46.0%

On student prefix

−16.1pt

Drop from prefix

Why it breaks

Three mechanisms behind the failures.

Mechanism 01

Student prefixes distort the teacher

Teacher tokens are scored on partial trajectories the student wrote. Conditioned on a committed-to branch the teacher would never have chosen, it tends to emit revision tokens — "wait", "but" — that pull the trajectory sideways instead of forward, producing local semantic conflict.

Mechanism 02

Top-K reverse-KL has a biased gradient

Truncating reverse-KL to the top-K vocabulary keeps memory tractable but breaks a cancellation that holds for the full vocabulary. A spurious +1 term survives, so a token is only pushed up if the teacher gives it more than e× the student's mass — smaller margins are silently suppressed.

# surviving bias term in the gradient
∇L_topK-RKL ∝ Σ π_S(v) · [ log(π_S/π_T) + 1 ] · ∇log π_S(v)

Mechanism 03

OPSD only learns the PI-free margin

The student never sees PI, so the optimum it is pushed toward is the geometric mean across PI-conditioned teachers. When PI varies per problem, those teachers prescribe incompatible behaviors, and the consensus is weaker than any of them.

Illustration of local semantic conflict in OPD. — When the teacher's preferred path diverges from the student's prefix, supervision often pushes the student to switch branches mid-thought rather than refine the current one.

A look at collapse: hedging tokens take over.

Under unnormalized Top-20 reverse KL, training is initially fine and then deteriorates around step 700: rollouts lengthen, fillers spike, and by step 1000 the model is near-deterministically producing "maybe".

Word cloud at training step 0 — Step 0 — well-formed reasoning vocabulary.

Word cloud at training step 700 — Step 700 — verbosity rises, "wait" / "but" inflate.

Word cloud at training step 1000 — Step 1000 — degenerate "maybe" loop dominates.

Why PI structure is the deciding factor.

OPSD effectiveness depends on PI structure. — Instance-specific PI (e.g., a problem's gold answer) drives OPSD-conditioned teachers in incompatible directions per example. Shared-rule PI (e.g., a fixed system prompt) gives the student a single inductive bias to absorb.

What to do

Three fixes that recover stability.

Fix 01

Stop-gradient Top-K KL

Stop gradients on the student log-prob inside the loss, leaving only the teacher–student log-ratio as an advantage-like weight. The biased +1 term disappears and training stays stable. Renormalizing within the Top-K set is a comparable alternative.

Fix 02

RLVR-adapt the teacher first

Run RL with verifiable rewards on the teacher before distilling. A 1.7B model lifted by GRPO is a better OPD teacher than an out-of-the-box 8B — closer in distribution to the student even at similar accuracy, so its token-level signals fit the student's prefixes.

Fix 03

SFT-warm the student

When the student emits malformed or off-language tokens, the teacher cannot give meaningful feedback. A short SFT pass on teacher-generated traces reins in the output space, stabilizes response length, and lets the subsequent OPD phase improve accuracy instead of collapsing.

The recipe we end up recommending. SFT initializes a well-formed student → RLVR adapts the teacher onto the training distribution → on-policy distillation transfers the adapted teacher's behavior back into the student with a stop-gradient Top-K loss. Each step removes one of the failure mechanisms above.

Takeaways

Design checklist for on-policy distillation.

Match PI to task structure. Use OPSD when the privileged information encodes a shared rule (system prompt, alignment preference). Avoid it when PI is per-instance, like a problem's gold answer.
Pick teachers by distribution closeness, not benchmark score. A weaker but on-distribution teacher often distills better than a stronger but distant one.
Treat Top-K reverse-KL with care. The unnormalized form has a hidden bias; use stop-gradient or renormalized variants, or move the signal into a policy-gradient form.
Stabilize the student before distilling. A short SFT pass keeps on-policy samples in regions where teacher feedback is informative.
Watch for collapse signatures. Rising "wait" / "maybe" / "but" frequency and a falling teacher–student overlap ratio precede full degeneration; intervene early.