Poster Sun, Jun 7, 2026 • 2:30 PM – 4:30 PM PDT ExHall A 156

SAMTok: Representing Any Mask with Two Words

yikang zhou ⋅ Tao Zhang ⋅ Dengxian Gong ⋅ Yuanzheng Wu ⋅ Ye Tian ⋅ Haochen Wang ⋅ Haobo Yuan ⋅ Jiacong Wang ⋅ Lu Qi ⋅ Hao Fei ⋅ Shunping Ji ⋅ Anran Wang ⋅ Zhuochen Wang ⋅ Yujing Wang ⋅ Cheng CHEN ⋅ Xiangtai Li

Highlight

Paper PDF

Abstract

Pixel-wise capabilities are essential for building interactive intelligent systems. However pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To solve these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two textual special tokens and reconstructs masks from these tokens with high fidelity. By treating masks as a new language, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a simple and scalable paradigm for equipping MLLMs with strong pixel-wise capabilities. Code and models will be available.