SAM 3D: 3Dfy Anything in Images
Xingyu Chen ⋅ Fu-Jen Chu ⋅ Pierre Gleize ⋅ Kevin Liang ⋅ Alexander Sax ⋅ Hao Tang ⋅ Weiyao Wang ⋅ Michelle Guo ⋅ Thibaut Hardin ⋅ Xiang Li ⋅ Aohan Lin ⋅ Jia-Wei Liu ⋅ Ziqi Ma ⋅ Anushka Sagar ⋅ Bowen Song ⋅ Xiaodong Wang ⋅ Jianing "Jed" Yang ⋅ Bowen Zhang ⋅ Piotr Dollár ⋅ Georgia Gkioxari ⋅ Matt Feiszli ⋅ Jitendra Malik
Abstract
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a $5:1$ win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.
Successful Page Load