Localizing, Structuring, and Rendering: Bridging 3D and 2D Vision-Language-Action Models for Robotic Manipulation
Abstract
Robotic manipulation in complex 3D environments requires unifying spatial reasoning with intuitive visual perception, which is a capability that current Vision-Language-Action paradigms address separately. While 3D VLAs excel in geometric and physical reasoning, they lack intuitive, image-level understanding and dense visual semantics; conversely, 2D VLAs (even with depth image) provide rich visual intuition and semantic continuity but miss explicit spatial global grounding. We introduce DiffRender-VLA, a differentiable rendering–based framework that bridges 3D and 2D Vision-Language-Action models through gradient-consistent visual mediation. It generates differentiable images by localizing the next end-effector target with a world-aligned cube marker, differentiably structuring surrounding geometry whose color encodes spatial relations to the marker, and rendering adaptive viewpoints optimized to reveal the target–environment spatial relationships. These differentiable images serve as visual bridges, embedding spatial semantics while allowing gradients from 2D VLAs to backpropagate into 3D representations, thereby coupling geometric reasoning with visual perception. This closed differentiable loop unifies reasoning and perception, substantially improving performance under occlusion, clutter, and complex spatial manipulation tasks, achieving average improvements of +12.1% over state-of-the-art methods.