UniVerse: Empower Unified Generation with Reasoning and Knowledge
Abstract
Current text-to-image (T2I) generation models often struggle with prompts that require complex reasoning or specialized knowledge, failing to accurately interpret implicit user intent. To bridge this gap, we introduce \textbf{T2I-Reason}, a large-scale dataset designed to empower text-to-image generation in unified multimodal models (UMMs) with reasoning and knowledge. The dataset contains 120k pairs of text triplet and image. The text triplet consists of (1) an implicit prompt, which requires reasoning or knowledge to decipher its underlying meaning; (2) a reasoning chain, which provides a step-by-step analysis to resolve the implicit prompt's meaning; and (3) an explicit prompt, a clear and straightforward visual description prepared for T2I generation. T2I-Reason is meticulously constructed: 65k samples are dedicated to reasoning, specifically targeting arithmetic reasoning, spatial-attribute relationship reasoning, deductive reasoning (cause to effect), and abductive reasoning (effect to cause). While 55k samples necessitate specialized knowledge, which covers multiple disciplines, spatial-temporal concepts, and entity knowledge. To validate the effectiveness of our dataset, we train a unified multimodal model, Bagel, on our dataset. Results across multiple benchmarks that evaluate the reasoning capabilities of T2I generation demonstrate that our model achieves significant and consistent improvements on both composition and reasoning, confirming that explicit training on intermediate reasoning chains is a pivotal step towards more intelligent unified generative models.