Learning 3D Reconstruction with Priors in Test Time
Abstract
We introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks, without retraining or modifying the pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference. The optimization loss is composed of a self-supervised objective and prior penalty terms. The self-supervised objective is defined as the compatibility among multi-view predictions, implemented by the photometric or geometric loss between the renderings from other views and each view itself. Any available priors are turned into the penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method consistently improves performance over base MVTs by a large margin. On ETH3D, 7-Scenes, and NRGBD datasets, our method cuts the point map distance error by more than half compared to the base image-only models. Our method also outperforms those re-trained prior-aware feed-forward methods, demonstrating the effectiveness of our test-time constrained optimization (TCO) framework, in incorporating priors for 3D vision tasks.