Token Warping Helps MLLMs Look from Nearby Viewpoints
Abstract
Can warping tokens, rather than pixels, help multimodal large language models (MLLMs) understand how a scene appears from nearby viewpoints? While MLLMs perform well on a single image reasoning, they remain fragile to viewpoint changes because pixel-level warping is highly sensitive to small depth errors and often introduces geometric distortions. Drawing on theories of mental imagery that posit part-level structural representations as the basis for human perspective transformation, we examine whether image tokens in ViT-based MLLMs serve as an effective substrate for viewpoint warping. We compare two token-level transformation strategies, forward and backward warping, and find that backward token fetching, which selects tokens at target-view grid locations and retrieves their counterparts from the source view, achieves greater stability and better preserves semantic coherence under viewpoint shifts. Experiments on our proposed ViewBench benchmark demonstrate that token-level warping enables MLLMs to reason reliably from nearby viewpoints, while consistently outperforming all baselines, including pixel-warping approaches, MLLMs fine-tuned for spatial reasoning, and a generative warping method.