SMVRT: Implicit Human 3D Modeling Using Sparse Multi-view Volumetric Reconstruction with Transformer Fusion
Abstract
Recently, the community has witnessed significant progress in human modeling from a single view or multi-views, which often involves "guessing" the occluded parts using either generative models or template fitting. In this work, we address these challenges by exploring optimal fusion strategies from sparse views only. We propose an end-to-end implicit 3D reconstruction framework using a sparse multi-view setup. Specifically, we achieve this by exploring fusion blocks at three stages of the network. First, 2D feature encoders carrying out locally and globally, which produce enhanced features. Second, 3D feature grid, formed by attentional fusion of warped multi-view and multi-level 2D features, which follows 3D regularization of feature grids to aggregate spatially coherent multi-view features. Third, attentional 2D3D feature aggregation associated to query point generate enhanced latent embedding, which is fed into an implicit field decoder for robust occupancy prediction. Evaluation on the THUman 2.1, MultiGarment dataset demonstrates that our system significantly outperforms state-of-the-art methods both qualitatively and quantitatively.