What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models
Yingqi Fan ⋅ Junlong Tong ⋅ Anhao Zhao ⋅ Xiaoyu Shen
Abstract
Multimodal LLMs (MLLMs) convert images into visual tokens for language-model processing, yet how these tokens encode semantics remains unclear. In this paper, we identify a consistent token structure across models: visual tokens cluster into sink, dead, and alive groups, with only the alive tokens ($\approx60$%) carrying meaningful information. Sink and dead tokens can be removed without hurting performance. Using a patch-compression benchmark and our probing tool *EmbedLens*, we show that alive tokens already encode fine-grained cues (objects, colors, OCR) before entering the LLM. Internal visual computation (visual attention and FFNs) are redundant and offers limited benefit for most tasks. This redundancy also extends to the model's depth: Our analysis shows that alive tokens align best with mid-layer LLM representations, while shallow layers contribute little. These findings provide a unified view of visual semantics in MLLMs and motivate architectures that use fewer visual tokens, reduced visual computation, and mid-layer injection for better efficiency and interpretability.
Successful Page Load