(논문 요약) Chameleon; Mixed-Modal Early-Fusion Foundation Models | Jaemin’s Arxiv

(논문 요약) Chameleon: Mixed-Modal Early-Fusion Foundation Models (Paper)

핵심 내용

~10T 정도 pretrain.

Tokenizer

Image Tokenization:a 512 × 512 image into 1024 discrete tokens from a codebook of size 8192
Tokenizer: (trained by author) BPE tokenizer with a vocabulary size of 65,536, which includes the 8192 image codebook tokens

Pretraining

stage 1 (80% of training): 2.9 trillion text-only + 1.5 trillion text-image (512x512) tokens + 400 billion tokens of interleaved text and image data
stage 2 (20% of training): 50% of data in stage 1 + higher quality datasets (a filtered subset of instruction tuning sets)

Numerical Stability

re-ordering of the norms in Chameleon-34B
architecture
z-loss definition:

필요 자원

실험 결과