(논문 요약) QWEN2 TECHNICAL REPORT (paper)

핵심 내용

  • Tokenizer: byte-level bytepair encoding of Qwen
  • Architecture
    • MoE
    • GQA
    • Dual Chunk Attention with YARN (for long context)
    • SwiGLU
    • RoPE
    • QKV bias
    • RMSNorm
    • pre-normalization

  • Pretrain data: 7T tokens (30 languages)
  • Long Context Training
    • context window: 4,096 tokens -> 32,768 tokens
    • base frequency of RoPE: 10,000 -> 1,000,000
  • Post-training
    • SFT: 500,000 examples (instruction following, coding, mathematics, logical reasoning, role-playing, multilingualism, and safety)
    • DPO: offline preference dataset, online preference generated by a reward model (online 학습의 경우, Online Merging Optimizer 라는 걸 썼다고 함)

실험 결과

  • benchmark 에서 Llama-3-70B 보다 높은 성능