MLLMSplat: A 2D MLLM-Powered Framework for 3D Gaussian Splatting Understanding, Generation, and Editing
Abstract
3D Gaussian Splatting (3DGS) has emerged as a mainstream representation for 3D scenes, drawing increasing research attention to its understanding, generation, and editing. However, existing studies remain limited to low-level perception, low-quality generation, and low-efficiency editing, lagging far behind their image counterparts in the era of Multimodal Large Language Models (MLLMs). To bridge this gap, we propose MLLMSplat, a novel framework that adapts 2D MLLMs to achieve high-level understanding, high-quality generation, and high-efficiency editing of 3DGS scenes. Specifically, our comprehensive framework consists of three core designs: (1) a 3DGS tokenizer that can be seamlessly integrated into existing MLLMs in a training-free manner; (2) a 3DGS de-tokenizer that non-intrusively extends the 2D latent diffusion model in MLLMs using a dual positional encoding space, while augmenting it with a jointly trained and sampled 3DGS decoder; and (3) a surrogate task that enhances feedforward editing capabilities. Extensive experiments demonstrate that MLLMSplat delivers state-of-the-art performance across 3DGS understanding, generation, and editing.