FISHuman: Fine-grained Single-image 3D Human Reconstruction via Multi-view 4D Remeshing
Abstract
Single-image 3D human reconstruction holds significant promise due to its convenience and high demand in various applications. Previous methods have garnered tremendous progress by employing 2D multi-view diffusion models to generate auxiliary views as reconstruction priors, but they struggle with 3D inconsistencies and limited generalization capabilities. In this paper, we present FISHuman, which aims to generate fine-grained, high-fidelity, and content-wise diverse 3D humans from a single-view input, providing production-ready 3D assets. We propose an elaborately designed workflow that reconstructs dynamic 3D meshes from multi-view inconsistent guidance. Specifically, we adapt a dual-stream transformer-based video diffusion model to generate cross-modally aligned multi-view RGB and normal sequences. We find that naively employing static 3D reconstruction can lead to geometric distortions and texture blurriness, due to the lack of 3D awareness within the generated frames. To address this, we introduce a novel 4D remeshing module that explicitly disentangles the learning of the globally shared canonical mesh and transient variations by tracking per-vertex deformations under different viewpoints. The topological consistency of the deformed meshes inherently enables the optimization of a unified UV representation that effectively integrates appearance attributes across frames. Both qualitative and quantitative experimental results demonstrate the superiority of our method over prior works in terms of appearance realism, geometric fineness, and generalization diversity. We also showcase the applicability of our reconstructed avatars for downstream applications including animation and 3D editing.