Prompt Yourself: Awakening Textual Semantics in 1D Visual Tokenizers
Abstract
One-dimensional (1D) visual tokenizers offer notable semantic compactness by discarding local spatial priors, and have become increasingly popular for image reconstruction and generation tasks. However, such global and sequential representations struggle to preserve fine-grained visual content; simply increasing network size or token count offers only superficial mitigation. To address this, we introduce \textit{\textbf{VLTok}}, a novel 1D hybrid tokenizer that unifies \textit{\textbf{V}}isual and \textit{\textbf{L}}anguage representations in a shared \textit{\textbf{Tok}}en space through a \textit{\textbf{self-prompted}} training paradigm. During training, VLTok simultaneously generates 1D visual and textual tokens from images, aligning the textual tokens with embeddings from a pre-trained language model. This cross-modal alignment infuses implicit linguistic cues into the tokenizer, enhancing fine-grained image encoding. At inference, the self-prompted paradigm eliminates the need for external text, maintaining the simplicity of the image-only framework while benefiting from multi-modal guidance. Extensive experiments on the ImageNet benchmark demonstrate that VLTok achieves state-of-the-art performance in both image reconstruction and image generation. For example, under the same model parameter budget, our method yields relative reduction of 11.1% in rFID and 18.7% in gFID compared to GigaTok.