(논문 요약) π0.5: a Vision-Language-Action Model with Open-World Generalization (Paper)

핵심 내용

  • Vision Language Action model
    • high-level subtask 생성후 action 생성 (FAST action tokenizer)
    • pre-trained SigLIP (400M) + Gemma (2.6B) 활용

  • Data
    • pretrain
      • Diverse Mobile Manipulator data: ~400H data of mobile manipulators performing household tasks in about 100 different home environments
      • Diverse Multi-Environment non-mobile robot data
      • Cross-Embodiment laboratory data
      • High-Level subtask prediction
      • Multi-modal Web Data (public data)
    • post-train
      • Diverse Mobile Manipulator data
      • Diverse Multi-Environment non-mobile robot data
  • 학습
    • pretrain, post-train 시, language 는 cross entropy loss
    • post-train 시, action 은 flow-matching loss
    • pretrain 280k steps, post-train 80k steps