LaRP: Efficient Multi-View Inpainting with Latent Reprojection Priors
Abstract
The task of multi-view inpainting necessitates 3D consistency in the inpainted images. Most prior methods first employ single-view 2D inpainting and then enforce multi-view consistency in a post-hoc 3D optimization stage, which leads to undesirable artifacts and lengthy optimization times. The existing single-stage method, MVInpainter, uses video priors and is pose-free, making it less suitable for inputs beyond video sequences. In this paper, we propose a framework that trains an inpainting model to condition on the explicit and reliable multi-view correspondences from a 3D foundation model. Central to our framework is a cross-view conditioning architecture, LaRP, carefully designed to utilize both the generative prior of a pretrained diffusion inpainting model and the reprojected cross-view appearance latents. We additionally propose a scalable data pipeline for stable training of LaRP. Extensive experiments demonstrate that LaRP outperforms prior methods in 3D consistency and novel view synthesis quality competitive with the state-of-the-art, while being ∼50x faster.