(논문 요약) xGen-MM (BLIP-3): A Family of Open Large Multimodal Models (Paper)

핵심 내용

  • Architecture: ViT (+perceiver resampler) and phi3-mini
  • 데이터
    • DPP data: VLFeedback (data scored by GPT4-V)

실험 결과

  • few-shot 성능은 Idefics2-8B 보다 떨어짐