(논문 요약) Chameleon: Mixed-Modal Early-Fusion Foundation Models (Paper)
핵심 내용
- ~10T 정도 pretrain.
Tokenizer
- Image Tokenization:a 512 × 512 image into 1024 discrete tokens from a codebook of size 8192
- Tokenizer: (trained by author) BPE tokenizer with a vocabulary size of 65,536, which includes the 8192 image codebook tokens
Pretraining
- stage 1 (80% of training): 2.9 trillion text-only + 1.5 trillion text-image (512x512) tokens + 400 billion tokens of interleaved text and image data
- stage 2 (20% of training): 50% of data in stage 1 + higher quality datasets (a filtered subset of instruction tuning sets)
Numerical Stability
- re-ordering of the norms in Chameleon-34B
- architecture
- z-loss definition:
필요 자원
실험 결과