(논문 요약) PaliGemma 2: A Family of Versatile VLMs for Transfer (Paper)

핵심 내용

  • PaliGemma2 = ViT + Gemma2

  • Input resolutions
    • 224x224
    • 448x448
    • 896x896
  • Tasks
    • OCR
    • table structure recognition
    • molecular structure recognition
    • music score recognition
    • long fine-grained captioning
    • radiography report generation
  • Location: 각 token 의 index 사용

  • 학습
    • Stage 1: 1B multimodal task data, 전체 parameter finetune, image resolution 224x224
    • Stage 2-1: 50M (1B 데이터의 샘플), image resolution 448x448
    • Stage 2-2: 10M (1B 데이터의 샘플), image resolution 896x896
    • Stage 3: stage 1 혹은 2 의 모델을 task 에 맞게 finetune (document-related tasks, long caption generation, medical image understanding)
    • Learning rate
      • 3B: $10^{-5}$
      • 10B, 28B: $5\times 10^{-6}$

성능

  • Resolution, model size 변화에 따른 성능 차이