(논문 요약) SFT Memorizes, RL Generalizes; A Comparative Study of Foundation Model Post-training

(논문 요약) SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training (Paper)

핵심 내용

2개 benchmark 에서 실험.
- GeneralPoints: arithmetic reasoning card game
- V-IRL: real-world navigation environment
RL 이 SFT 보다 Out-of-Distribution 에서 성능이 좋음.
- RL (PPO) especially when trained with an outcome-based reward, generalizes in both the rule-based textual and visual environments.
- SFT, in contrast, tends to memorize the training data and struggles to generalize out-of-distribution in either scenario.
SFT 로 학습해서 instruction following 하는 모델을 RL 돌려야 효과가 있음.