(논문 요약) Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers (Paper)
핵심 내용
Heterogeneous Pre-trained Transformers (HPT): pretrain a large, shareable trunk of a policy neural network to learn a task and embodiment agnostic shared representation
Architecture
- input: proprioception + camera inputs
- output: action tokens
- 개별 로봇팔의 input 을 step 에서, output 을 head 에서 처리
- 가운데 transformer 는 공용
- step 에서, proprioception 는 MLP, vision 은 encoder 로 feature 만든뒤, learnable quries 로 attnetion.
실험 결과
- pretraining 효과가 있음.