(논문 요약) Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation (Paper)
핵심 내용
- understanding encoder 로 SigLIP, generation encoder 로 VQ tokenizer 사용.
학습
- autoregressive loss
- Unified Pretraining
- Data: pure text data, multimodal understanding data, visual generation data
- 먼저, ImageNet-1k 로 simple visual generation training (inspired by Pixart paper)
- 그리고 나서, general text-to-image data 로 학습.
- Supervised Fine-tuning
- Data: pure text dialogue data, multimodal understanding data, visual generation data