DuetGen: Towards General Purpose Interleaved Multimodal Generation
Abstract
Unified multimodal generation aims to jointly model image-to-text and text-to-image tasks within a single architecture. However, current approaches struggle to produce coherent, interleaved sequences of text and images. This limitation hinders applications that rely on tightly integrated multimodal outputs—such as step-by-step instructional guides, visual planning tools, and interactive content editing—where textual explanations and visual elements must be generated in a coordinated manner. We introduce DuetGen, a general-purpose interleaved multimodal generation model and investigate data curation, architecture design, and evaluation. In terms of data, we construct a large-scale high-quality instruction-tuning corpus combining curated web content, rewritten multimodal conversations, and diverse synthetic examples covering everyday scenarios. Architecturally, DuetGen builds upon a pretrained MLLM and diffusion transformer (DiT) pretrained on video generation, avoiding costly unimodal pretraining while remaining scalable. A two-stage decoupled training strategy first instruct-tunes the MLLM and then aligns it with the DiT using large-scale curated interleaved image–text sequences. Experiments on public and newly constructed benchmarks show that DuetGen substantially outperforms prior open-source systems across text quality, image fidelity, and image–context alignment, achieving substantial gains on text-to-image and image-editing benchmarks. Code and data will be released.