UVU: Improving Multimodal Understanding via Vision-Language Unified Autoregressive Paradigm
Abstract
Despite remarkable advancements in multimodal large language models (MLLMs), their fine-grained visual understanding is constrained by reliance on pure textual supervision. To unify understanding and generation capabilities, unified autoregressive multimodal models introduce visual supervision; however, they impair multimodal understanding due to the effects of visual feature discretization and orthogonality between image-text loss gradients. In this paper, we observe that pixel-level image patches and textual tokens coexist in raw high-dimensional spaces with inherent input symmetry. Motivated by this insight, we propose UVU, a novel vision-language unified autoregressive framework that eschews vector quantization. It uniquely employs continuous visual encoding for lossless representation of visual inputs and proposes a large-scale iterative hierarchical clustering algorithm to construct a pixel-level visual codebook, thereby extending the vocabulary for unified supervision and enabling autoregressive generation of pixel-level image tokens alongside textual tokens. UVU effectively synergizes pixel-level visual perception with semantic-level visual understanding, internalizing visual generation capabilities and, for the first time, unlocking the facilitative role of visual supervision in enhancing understanding. Extensive experiments across multiple tasks demonstrate that MLLMs are capable of achieving superior multimodal understanding performance under the supervised learning paradigm of UVU.