(논문 요약) Mixture-of-Agents Enhances Large Language Model Capabilities (Paper)

핵심 내용

  • vision-language-action 모델
  • 970k real-world robot demonstrations (Open X-Embodiment dataset)
  • Parameter Efficient FineTuning 으로 new task 학습
  • 조금 움직이고 다시 조금 움직이며 동작 (데모 동영상은 배속)

  • archiectue
    • 600M visual encoder (SigLIP and DinoV2 - channelwise-concat)
    • 2-layer MLP projector
    • 7B Llama2
  • robot action
    • discretized action space (1st and 99th quantile for each variable in training data)
  • data curation
    • Open X dataset: >70 individual robots, >2M robot trajectories
    • Open X-Embodiment dataset: 다음 기준으로 Open X 에서 샘플
      • coherent input and output space across all training
      • balanced mix of embodiments, tasks, and scenes
  • 결과