Multi-view Consistent 3D Gaussian Head Avatars 'without' Multi-view Generation
Abstract
Generating large-scale 3D head avatars of non-existent identities with high-fidelity and strong multi-view consistency (MVC) is essential for applications such as synthetic crowds, digital twins, and large asset libraries. For high scalability, avatars must be generated from minimal resources, without costly MV studio captures or any 3D data. In this work, we target this challenging minimal-resource setting for 3D head generation. Second, we argue that the common strategy of enforcing MVC via intermediate MV image generation is both expensive and fundamentally fragile. Instead, we analyze how MVC can be induced by design, showing that intermediate view synthesis is unnecessary. To this end, we introduce MVCHead — a fast, single-shot state space model that directly predicts Gaussians, without intermediate generation. At its core, we propose a Hierarchical State Space (HiSS) block that enforces grid-aligned coherence while capturing long-range dependencies. We further modify Mamba's standard unidirectional scanning into a Hierarchical Bi-directional State Scan (HiBiSS), scanning the render grid to better propagate geometric and appearance cues. Finally, we design an SE(3) MV Critic that judges whether a set of self-renders arise from a single underlying 3D configuration, rewarding cross-view pixel alignment without real MV data. In this setting, MVCHead surpasses SOTA in perceptual quality and on all three MVC axes—shape, texture, and geometry. The code has been submitted and will be open-sourced with model weights upon acceptance.