(모델 요약) VL-JEPA: Joint Embedding Predictive Architecture for Vision-language (paper)
핵심 내용
- vision + query text 을 input 으로 하는 embedding model.
- 모델 구조
- vision encoder: V-JEPA 2 ViT-L
- tokenizer and token embedding: Llama-3.2-1B
- predictor: last 8 Transformer layers of Llama-3.2-1B
- answer text 를 EmbeddingGemma-300M 로 encoder 한 값을 target 으로 하여 학습.
- Task
- vision+text → text
- VQA, classification (option 을 embedding 하여 maximum cosine similarity 를 선택)
- text → vision retrieval (video encoding 이후 query 에 맞는 clip 찾기)
- Train
- pretrain
- image-text data: PLM-Image-Auto, Datacomp, YFCC-100M
- video-text data: PLM-Video-Auto, Ego4D atomic action descriptions, internal dataset Action100M (consisting captions generated on HowTo100M videos)
- 1 frame per visual input 학습 100k iterations (batch size 24k)
- 이어서, 16 frames per input 학습 (image-video pretraining)
- 24 nodes with 8×NVIDIA H200 GPUs 로 2주
- SFT
- PLM data mixture data
- 25M VQA samples
- 2.8M captioning samples
- 1.8M classification samples
- downsampled pretraining stage data (to avoid catastrophic forgetting)
- 35k steps, batch size 6k
- 24 nodes 로 ~2일
- PLM data mixture data
- pretrain