Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence
Abstract
Recent learning-based face reconstruction and registration frameworks such as ToFu and TEMPEH have shown that dense correspondence between facial scans and a common topology can be learned directly from images. However, these approaches still depend on precomputed registrations obtained through iterative optimization pipelines that often require manual verification and correction by human annotators. We introduce MOCHI (Multi-view Optimizable Correspondence of Heads from Images), a fully differentiable and registration-free alternative. Instead of relying on optimization-based registrations, we employ a pseudo-linear inverse kinematic solver in conjunction with dense 2D keypoints produced by a tracker trained only on synthetic data to directly enforce a common face topology at the vertex level. We further find that the commonly used point-to-surface distance can lead to unstable training and artifacts, and instead use pointmap- and normal-based losses that provide smoother gradients, more stable optimization, and improved reconstruction results.Additionally, we introduce at inference a brief test-time-optimization scheme which can further refine the results of the network, resulting in registrations that outperform traditional labor-intensive pipelines.Despite removing external registrations, our extensive experimental results show that MOCHI surpasses the previous state-of-the-art in reconstruction accuracy and visual fidelity. The code and the model will be made public.