(논문 요약) SAM 3D: 3Dfy Anything in Images (Paper)

핵심 내용

  • Architecture
    • shape token: 4096 tokens with $d=1024$ to represent a grid with resolution $64^3$.
    • layout token: 1 token for each rotation $R$, translation $t$, and scale $s$ ($d=1024$)

  • 데이터
    • ISO-3DO (Isolated 3D Objects): 다양한 viewpoints 에서 single objects.
      • Built from meshes (e.g., Objaverse-XL and licensed assets), producing image–shape–texture triplets.
      • Purpose: teach strong priors for geometry/texture in a clean, object-centric setting.
    • RP-3DO (Render–Paste 3D Objects): real images 에서 synthetic 3D objects 생성.
      • Flying Occlusions (FO): randomly oriented synthetic objects are inserted to create occlusion robustness and mask-following behavior.
      • Object Swap – Random (OS-R): an object in a real image is replaced by a random synthetic mesh, with translation/scale set from the mask and a depth/pointmap; rotation is random. Trains layout estimation and occlusion handling with realistic context.
      • Object Swap – Annotated (OS-A): like OS-R but uses human-annotated pose (translation/rotation/scale) and mesh choices, yielding pixel-aligned supervision useful for texture refinement.
    • MITL-3DO and Art-3DO (human-in-the-loop real data): real images 로 부터 human- and model-in-the-loop pipeline 으로 생성.
      • MITL-3DO: larger but noisier annotations from non-experts, used for supervised fine-tuning (SFT) and preference data (for DPO). Provides shapes, poses, masks, and for texture a separate collection emphasizing higher-aesthetics examples.
      • Art-3DO: smaller, high-quality meshes and alignments made by professional 3D artists. Used to raise geometric quality, symmetry, closure, and aesthetics, and to supply strong signals for preference alignment.
  • 학습
    1. pretrain on ISO-3DO.
    2. mid-train on RP-3DO (FO, OS-R + OS-A for texture) to gain robustness and layout skills in natural scenes.
    3. SFT on MITL-3DO then Art-3DO.
    4. preference optimization (DPO) using human choices.
    5. brief distillation for fast inference.