Poster Sat, Jun 6, 2026 • 3:45 PM – 5:45 PM PDT ExHall A & F 447

UVU: Improving Multimodal Understanding via Vision-Language Unified Autoregressive Paradigm

Zhehan Kan ⋅ Xinghua Jiang ⋅ Yanlin Liu ⋅ Xiaochen Yang ⋅ ZHIXIANG WEI ⋅ Shifeng Liu ⋅ Yubo Zhu ⋅ Qingmin Liao ⋅ Wenming Yang ⋅ Xin Li ⋅ Yinsong Liu ⋅ Deqiang Jiang ⋅ Xing Sun

Paper PDF

Abstract

Despite remarkable advancements in multimodal large language models (MLLMs), their fine-grained visual understanding is constrained by reliance on pure textual supervision. To unify understanding and generation capabilities, unified autoregressive multimodal models introduce visual supervision; however, they impair multimodal understanding due to the effects of visual feature discretization and orthogonality between image-text loss gradients. In this paper, we observe that pixel-level image patches and textual tokens coexist in raw high-dimensional spaces with inherent input symmetry. Motivated by this insight, we propose UVU, a novel vision-language unified autoregressive framework that eschews vector quantization. It uniquely employs continuous visual encoding for lossless representation of visual inputs and proposes a large-scale iterative hierarchical clustering algorithm to construct a pixel-level visual codebook, thereby extending the vocabulary for unified supervision and enabling autoregressive generation of pixel-level image tokens alongside textual tokens. UVU effectively synergizes pixel-level visual perception with semantic-level visual understanding, internalizing visual generation capabilities and, for the first time, unlocking the facilitative role of visual supervision in enhancing understanding. Extensive experiments across multiple tasks demonstrate that MLLMs are capable of achieving superior multimodal understanding performance under the supervised learning paradigm of UVU.