(논문 요약) META-REWARDING LANGUAGE MODELS: Self-Improving Alignment with LLM-as-a-Meta-Judge (Paper)

핵심 내용

  • 같은 LLM 으로, generation (actor) 과 evaluation (judge), meta evaluation (meta-judge) 의 역할을 하여 데이터 수집
    • generation data 의 preference data
    • evaluation data 의 preference data
  • Prompt
    • judge
    • meta-judge

실험 결과

  • length control, position preference 를 해결하기 위한 heuristic 을 사용하였고, iteration 4까지 학습함
    • Iter 1 (M1): DPO (initialized from the SFT model) on both actor and judge preference pairs generated by the SFT model
    • Iter 2 (M2): DPO (from M1) on actor and judge preference pairs generated by M1
    • Iter 3 (M3): DPO (from M2) exclusively on actor preference pairs generated by M2
    • Iter 4 (M4): DPO (from M3) exclusively on actor preference pairs generated by M3