Rosetta Stone For Unified MLLMs: A Unified Tokenizer to Decipher Understanding and Generation
Wenyu Sun ⋅ Hufei Li ⋅ Ruijin Jin ⋅ Xiangheng Kong ⋅ Yuning Jiang
Abstract
Major state-of-the-art unified tokenizers predominantly adopt pixel reconstruction and feature alignment as pretext tasks, they leave key domains largely unexplored such as architecture, supervised objectives and tasks interaction, potentially resulting in limited performance. We systematically investigate the critical factors of a unified visual tokenizer and propose a novel framework that strengthens synergy between understanding and generation in various aspects. Our initial analysis focus on properties of frontier vision models, confirming inherent conflict in contrastive learning style models for unifying generation and understanding, and demonstrate distinct convergence behavior of codebooks. To address the above bottleneck, we hierarchically decouple the conflicting proxy tasks, enriching the diversity of semantic features supervision to enhance thesemantic and low-level capabilities. Subsequently, we further introduce attention-prioritized mapping strategy, which guides fine-grained generation with powerful semantic prior. Our method achieves rFID of 0.33 and zero-shot accuracy of 80.9\% on ImageNet at 256$\times$256 resolution, surpassing VILA-U by 7.6\% and outperforms continuous embedding of SigLIP. When applied to discrete unified MLLMs, our 7B model exceeds TokenFlow-13B by 3.1\% in understanding and achieve SOTA performance in GenAI-Bench and MJHQ-30K.
Successful Page Load