Weaver: Decoupled Training for Interleaved Multi-modal Generation
Abstract
Recent unified multi-modal models have made unprecedented progress in understanding and generation, yet they largely support multi-modal inputs with single-modality outputs, struggling to produce complex interleaved text–image content due to data scarcity and the difficulty of modeling long-range cross-modal context. We introduce Weaver, which frames interleaved generation as an autoregressive planning–visualization process within a unified multi-modal architecture. A planner, i.e., understanding expert, digests rich text–image context to produce visualization triggers and their dense textual guidance except for plain text, while a visualizer, i.e., generation expert, produces images conditioned on the planner’s textual guidance and visual references. This design enables decoupled learning: we train the two experts on large collections of textual planning and reference-guided image data in parallel, yielding powerful interleaved multi-modal generation capability at inference. Moreover, training the planner with datasets from diverse understanding and generation tasks equips the model with automatic task inference. To analyze and evaluate the model from multiple dimensions, we further introduce a benchmark that covers a range of everyday use cases. Extensive experiments show that, even without or with only very limited real interleaved data training, Weaver achieves superior performance on interleaved multi-modal generation.