(논문 요약) Stabilizing Reinforcement Learning with LLMs: Formulation and Practices (paper)

핵심 내용

  • first-order approximation to the sequence-level

  • training 과 inference 시 experts routing 이 다름.
    • precisions (e.g., BF16 vs FP8)
    • batch-dependent ops
    • 작은 차이가 top-k expert choice 를 다르게 할수 있음.

Vanilla Routing Replay (R2)

  • old training engine 의 experts routing 을 사용.

Rollout Routing Replay (R3)

  • old inference engine 의 experts routing 을 사용.