Poster
Pippo: High-Resolution Multi-View Humans from a Single Image
Yash Kant · Ethan Weber · Jin Kyu Kim · Rawal Khirodkar · Zhaoen Su · Julieta Martinez · Igor Gilitschenski · Shunsuke Saito · Timur Bagautdinov
We present Pippo, a generative model capable of producing a dense set of high-resolution (1K) multi-view images of a person from a single photo. Our approach does not require any parametric model fitting or camera parameters of the input image, and generalizes to arbitrary identities with diverse clothing and hair styles. Pippo is a multi-view diffusion transformer trained in multiple stages. First, we pretrain the model on a billion-scale human-centric image dataset. Second, we train the model on studio data to generate many low-resolution consistent views conditioned on a coarse camera and an input image. Finally, we fine-tune the model on high-resolution data for multi-view generation with minimal placement controls, further improving consistency. This training strategy allows us to retain the generalizability from the large-scale pretraining while enabling high-resolution multi-view synthesis. We investigate several key architecture design choices for multi-view generation with diffusion transformers for precise view and identity control. Using a newly introduced 3D consistency metric, we demonstrate that Pippo outperforms existing approaches on multi-view human generation from a single image.
Live content is unavailable. Log in and register to view live content