DiffuView: Multi-View Diffusion Pretraining for 3D Aware Robotic Manipulation
Abstract
Robotic manipulation from visual observations remains challenging due to the lack of 3D consistent representations that can generalize across diverse viewpoints and sensor configurations. Existing approaches often rely on masked autoencoders or neural scene representations, which fail to capture cross view correspondences. Crucially, while multi-view diffusion models have recently shown tremendous success in 3D aware generative synthesis, their powerful representations offer a promising direction for achieving viewpoint robust visuomotor control. In this paper, we introduce DiffuView, a novel framework that learns unified 3D aware representations through multi-view diffusion pretraining and deploys them for imitation learning. Specifically, DiffuView models the conditional generation of target views given source observations within a diffusion framework, enabling the network to implicitly recover scene geometry and enforce view consistency. The pretrained diffusion network is then utilized as a powerful visual backbone for an action policy, allowing robust control under varying viewpoints and visual conditions. We evaluate DiffuView on two challenging benchmarks, MetaWorld and Libero. Extensive experiments in both simulation and realworld scenarios demonstrate that DiffuView achieves superior generalization, improving success rates under viewpoint shifts by nearly 20\% compared with existing methods.