(논문 요약) SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training (Paper)

핵심 내용

  • 2개 benchmark 에서 실험.
    • GeneralPoints: arithmetic reasoning card game

    • V-IRL: real-world navigation environment

  • RL 이 SFT 보다 Out-of-Distribution 에서 성능이 좋음.
    • RL (PPO) especially when trained with an outcome-based reward, generalizes in both the rule-based textual and visual environments.
    • SFT, in contrast, tends to memorize the training data and struggles to generalize out-of-distribution in either scenario.
  • SFT 로 학습해서 instruction following 하는 모델을 RL 돌려야 효과가 있음.