(논문 요약) Large Language Monkeys; Scaling Inference Compute with Repeated Sampling

(논문 요약) Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (Paper)

핵심 내용

DeepSeek-Coder-V2-Instruct vs. closed models
- moatless tool: SWE-bench 를 실행시킬수 있는 github repo
- CodeStory Aide: commercial agent (gpt4o + 3.5 Sonnet)
다양한 모델의 실험 결과
모델을 활용한 verification 은 무용지물
- Majority vote: most common final answer
- Reward Model + Best-of-N: ArmoRM-Llama3-8B-v0.1 reward model (SOTA reasoning on the RewardBench) 로 개별 output 에 score 메긴 뒤, 가장 높은 것 고름
- Reward Model + Majority Vote: ArmoRM-Llama3-8B-v0.1 reward model 로 개별 output 에 score 메긴 뒤, score 에 비례하여 sample
틀리는 케이스 분석: Chain of Thought 단계 하나 하나는 맞는 경우가 많음
일부 문제는 정답을 맞출 확률이 극히 낮음 (1만개 output 에서 몇개만 정답을 생성)