Poster Sat, Jun 6, 2026 • 3:45 PM – 5:45 PM PDT ExHall A & F 43

DuoGen: Towards Autonomous Interleaved Multimodal Generation

Min Shi ⋅ Xiaohui Zeng ⋅ Jiannan Huang ⋅ Yin Cui ⋅ Francesco Ferroni ⋅ Jialuo Li ⋅ Max Li ⋅ Yogesh Balaji ⋅ Haoxiang Wang ⋅ Tsung-Yi Lin ⋅ Xiao Fu ⋅ Yue Zhao ⋅ Chieh-Yun Chen ⋅ Ming-Yu Liu ⋅ Humphrey Shi

Highlight

Abstract

Unified multimodal generation aims to jointly model image-to-text and text-to-image tasks within a single architecture. However, current approaches struggle to produce coherent, interleaved sequences of text and images. This limitation hinders applications that rely on tightly integrated multimodal outputs—such as step-by-step instructional guides, visual planning tools, and interactive content editing—where textual explanations and visual elements must be generated in a coordinated manner. We introduce DuetGen, a general-purpose interleaved multimodal generation model and investigate data curation, architecture design, and evaluation. In terms of data, we construct a large-scale high-quality instruction-tuning corpus combining curated web content, rewritten multimodal conversations, and diverse synthetic examples covering everyday scenarios. Architecturally, DuetGen builds upon a pretrained MLLM and diffusion transformer (DiT) pretrained on video generation, avoiding costly unimodal pretraining while remaining scalable. A two-stage decoupled training strategy first instruct-tunes the MLLM and then aligns it with the DiT using large-scale curated interleaved image–text sequences. Experiments on public and newly constructed benchmarks show that DuetGen substantially outperforms prior open-source systems across text quality, image fidelity, and image–context alignment, achieving substantial gains on text-to-image and image-editing benchmarks. Code and data will be released.