WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation
Wei Chow ⋅ Jiachun Pan ⋅ Yongyuan Liang ⋅ Mingze Zhou ⋅ Xue Song ⋅ Liyu Jia ⋅ Saining Zhang ⋅ Siliang Tang ⋅ Juncheng Li ⋅ Fengda Zhang ⋅ Weijia Wu ⋅ Hanwang Zhang ⋅ Tat-seng Chua
Abstract
Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation.However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing.To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation.Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of $100$K interleaved samples spanning over $370$K dialogue turns and $500$K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with $100$ tasks based on $480$ images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains.Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing.We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.
Successful Page Load