EchoVDiff: Cardiac-Cycle Echocardiography Video Generation from Arbitrary Frame
Abstract
Reconstructing a physiologically plausible cardiac video from a single image remains a fundamental challenge in generative modeling, owing to the complex and nonlinear periodic dynamics of echocardiography. Previous image-to-video (I2V) approaches primarily focus on temporal continuity, yet often struggle to capture the intrinsic periodicity of cardiac motion, leading to limited temporal coherence and semantic consistency. We present EchoVDiff, a novel phase-aware diffusion model that reconstructs a full cardiac cycle from any single frame. Instead of direct pixel synthesis, EchoVDiff integrates physiological priors into a diffusion paradigm, learning interpretable mappings between cardiac phase, anatomy, and motion. By jointly modeling temporal rhythm and spatial semantics within a disentangled latent space, it achieves controllable and physiologically consistent generation. Extensive experiments on EchoNet-Dynamic and EchoNet-Pediatric demonstrate that EchoVDiff consistently surpasses state-of-the-art methods in both fidelity and temporal coherence. Remarkably, it enables accurate reconstruction of complete cardiac cycles from arbitrary phases, marking the first demonstration of single-frame-driven echocardiographic video generation.