Skip to yearly menu bar Skip to main content


Instruct-Imagen: Image Generation with Multi-modal Instruction

Hexiang Hu · Kelvin C.K. Chan · Yu-Chuan Su · Wenhu Chen · Yandong Li · Kihyuk Sohn · Yang Zhao · Xue Ben · William Cohen · Ming-Wei Chang · Xuhui Jia

Arch 4A-E Poster #221
[ ] [ Project Page ]
Wed 19 Jun 5 p.m. PDT — 6:30 p.m. PDT
Oral presentation: Orals 2A Image & Video Synthesis
Wed 19 Jun 1 p.m. PDT — 2:30 p.m. PDT


This paper presents Instruct-Imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks.We introduce multi-modal instruction for image generation, a task representation articulating a range of generation intents with precision.It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, \etc), such that abundant generation intents can be standardized in a uniform format.We then build Instruct-Imagen by fine-tuning a pre-trained text-to-image diffusion model with two stages. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multi-modal context.Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that Instruct-Imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks. Our evaluation suite will be made publicly available.

Live content is unavailable. Log in and register to view live content