(논문 요약) SAM 3D: 3Dfy Anything in Images (Paper)
핵심 내용
- Architecture
- shape token: 4096 tokens with $d=1024$ to represent a grid with resolution $64^3$.
- layout token: 1 token for each rotation $R$, translation $t$, and scale $s$ ($d=1024$)

- 데이터
- ISO-3DO (Isolated 3D Objects): 다양한 viewpoints 에서 single objects.
- Built from meshes (e.g., Objaverse-XL and licensed assets), producing image–shape–texture triplets.
- Purpose: teach strong priors for geometry/texture in a clean, object-centric setting.
- RP-3DO (Render–Paste 3D Objects): real images 에서 synthetic 3D objects 생성.
- Flying Occlusions (FO): randomly oriented synthetic objects are inserted to create occlusion robustness and mask-following behavior.
- Object Swap – Random (OS-R): an object in a real image is replaced by a random synthetic mesh, with translation/scale set from the mask and a depth/pointmap; rotation is random. Trains layout estimation and occlusion handling with realistic context.
- Object Swap – Annotated (OS-A): like OS-R but uses human-annotated pose (translation/rotation/scale) and mesh choices, yielding pixel-aligned supervision useful for texture refinement.
- MITL-3DO and Art-3DO (human-in-the-loop real data): real images 로 부터 human- and model-in-the-loop pipeline 으로 생성.
- MITL-3DO: larger but noisier annotations from non-experts, used for supervised fine-tuning (SFT) and preference data (for DPO). Provides shapes, poses, masks, and for texture a separate collection emphasizing higher-aesthetics examples.
- Art-3DO: smaller, high-quality meshes and alignments made by professional 3D artists. Used to raise geometric quality, symmetry, closure, and aesthetics, and to supply strong signals for preference alignment.
- ISO-3DO (Isolated 3D Objects): 다양한 viewpoints 에서 single objects.
- 학습
- pretrain on ISO-3DO.
- mid-train on RP-3DO (FO, OS-R + OS-A for texture) to gain robustness and layout skills in natural scenes.
- SFT on MITL-3DO then Art-3DO.
- preference optimization (DPO) using human choices.
- brief distillation for fast inference.
