Learning to Generate via Understanding: Understanding-Driven Intrinsic Rewarding for Unified Multimodal Models
Abstract
Recently, unified multimodal models (UMMs) have made remarkable progress in integrating visual understanding and generation, demonstrating strong potential for complex text-to-image (T2I) tasks. Despite their theoretical promise, a persistent capability gap exists: UMMs typically exhibit superior visual understanding but comparatively weaker generative capabilities.This discrepancy arises largely from the intrinsic decoupling between the understanding and generation processes.While a UMM can accurately interpret fine-grained visual details, it often struggles to produce semantically coherent images from complex textual prompts.To address this challenge, we explore UMMs' internal understanding capability to enhance generation quality.We propose a token-level intrinsic text-image alignment reward mechanism,GvU, enabling the UMM to act simultaneously as teacher and student: it evaluates its own outputs using the understanding branch to guide the generations accordingly. Building upon this, we design a self-supervised reinforcement learning framework, allowing UMMs to iteratively improve their generation quality through understanding-based intrinsic reward signals—without reliance on external supervision.Experimental results show that our method substantially boosts UMMs' generation, which in turn strengthens their fine-grained visual understanding, narrowing the capability gap between UMMs' visual understanding and generation.