Merge3D: Efficient 3D Multimodal LLMs via Joint 2D-3D Token Merging
Tianbo Pan ⋅ Xingyi Yang ⋅ Xinchao Wang
Abstract
Multimodal Large Language Models (MLLMs) incorporating 3D geometry demonstrate significant power in 3D scene understanding. Their primary bottleneck, however, is the substantial computational burden associated with processing multi-view, lengthy visual token sequences. To surmount this challenge, we propose \textbf{Merge3D}, a geometry-aware token merging framework that integrates both 3D geometry and 2D semantic information. Conventional 2D compression methods, which rely solely on semantic signals, prove inadequate for 3D tasks, as they tend to discard spatially critical tokens and damage grounding performance. Merge3D bridges the modalities with a Semantic–Geometric Token Merger (SemGeo Merger): 2D attention is used to select semantically salient dominant tokens, while a hybrid 2D+3D similarity assigns and aggregates contextual tokens from spatially coherent 3D neighborhoods. This preserves 3D structural priors and inter-frame correspondences under aggressive compression. Merge3D achieves up to 70\% visual token reduction and up to $\sim$3$\times$ inference speedup, while retaining strong performance on 3D grounding, captioning, and spatial reasoning benchmarks such as Scan2Cap, CV-Bench, and BLINK.
Successful Page Load