(논문 요약) One Token to Fool LLM-as-a-Judge (paper)
핵심 내용
- false posivie rewards
- non-word symbols (e.g.,”:” or “.”)
- reasoning openers
- “Thought process:”
- “Let’s solve this problem step by step.”
- Qwen2.5-7B-Instruct 모델에 SFT -> Master-RM
- 학습데이터: 160k public data + FP 방지 목적 aug-data
- FP 방지 목적 aug-data: GPT-4o-mini 답변의 첫번째 문장만 가져와서 ‘No’ 로 레이블 (20k)
- 예시: “To solve the problem, we need to find the mode, median, and average of the donation amounts from the students.”