Archon: A Unified Multimodal Model for Holistic Digital Human Generation
Abstract
We introduce Archon, a unified multimodal framework that extends multimodal language models to address the fundamental challenge of holistic digital human generation. Archon unifies diverse human-centric modalities, including description, script, speech, animation, semantic segmentation, image and video, within a single controllable generative system, enabled by modality-specific tokenization and auto-regressive cross-modal reasoning. For high-quality video outputs, we incorporate a semantic-driven video diffusion decoder that reconstructs photorealistic video from compact representations. We further analyze cross-modality ambiguity and explore alternative modality generation chain that improves controllability and coherence. Experiments demonstrate strong performance across diverse multimodal generation tasks without task-specific fine-tuning.