(논문 요약) DOES RLHF SCALE? EXPLORING THE IMPACTS FROM DATA, MODEL, AND METHOD (Paper)

핵심 내용

  • Increasing data diversity and volume improves reward model performance.
  • More response samples per prompt boost performance initially but quickly plateau.
  • Larger reward models offer modest gains in policy training.
  • Larger policy models benefit less from RLHF with a fixed reward model.

  • task 에 따라 성능 추이가 다름.