(논문 요약) EMERGENT HIERARCHICAL REASONING IN LLMS THROUGH REINFORCEMENT LEARNING (Paper)

핵심 내용

  • RL 학습시 low level execution tokens 가 먼저 안정화됨.
    • arithmetic calculations
    • variable substitutions
    • direct application of known formulas
  • high-level strategic tokens 는 계속해서 학습됨.
    • deduction (e.g., we can use the fact that)
    • branching (e.g., let’s try a different approach)
    • backtracing (e.g., but the problem mentions that)

  • HICRA reward: $\alpha=0.2$ 을 설정하여, advantage 가 양수일때 strategic tokens 의 비중을 높이고, 음수일때 낮춤.