(논문 요약) Stabilizing Reinforcement Learning with LLMs; Formulation and Practices | Jaemin’s Arxiv

(논문 요약) Stabilizing Reinforcement Learning with LLMs: Formulation and Practices (paper)

핵심 내용

first-order approximation to the sequence-level

training 과 inference 시 experts routing 이 다름.
- precisions (e.g., BF16 vs FP8)
- batch-dependent ops
- 작은 차이가 top-k expert choice 를 다르게 할수 있음.

Vanilla Routing Replay (R2)

old training engine 의 experts routing 을 사용.

Rollout Routing Replay (R3)

old inference engine 의 experts routing 을 사용.