Poster
One Diffusion to Generate Them All
Duong H. Le · Tuan Pham · Sangho Lee · Christopher Clark · Aniruddha Kembhavi · Stephan Mandt · Ranjay Krishna · Jiasen Lu
We introduce \texttt{OneDiffusion} - a single large-scale diffusion model designed to tackle a wide range of image synthesis and understanding tasks. It can generate images conditioned on text, depth, pose, layout, or semantic maps. It also handles super-resolution, multi-view generation, instant personalization, and text-guided image editing. Additionally, the same model is also effective for image understanding tasks such as depth estimation, pose estimation, open-vocab segmentation and camera pose estimation, etc.Our unified model is trained with simple but effective recipe: we casting all tasks as modeling a sequence of frames where we inject different noise scales during training to each frame. During inference, we can set any frame to a clean image for tasks like conditional image generation, camera pose estimation, or generating paired depth and images from a caption. Experiment results shows our proposed method achieve competitive performance on wide range of tasks. By eliminating the need for specialized architectures or finetuning, OneDiffusion provides enhanced flexibility and scalability, reflecting the emerging trend towards general-purpose diffusion models in the field.
Live content is unavailable. Log in and register to view live content