Taming Video Models for 3D and 4D Generation via Zero-Shot Camera Control
Abstract
Video diffusion models have rich world priors, but their use in spatial tasks is limited by poor control, spatial-temporal inconsistent results, and entangled scene-camera dynamics. Current approaches, such as per-task fine-tuning or post-process warping strategies, are insufficient, often introducing visual artifacts, failing to generalize, or incurring high computational costs. We introduce a novel, training-free framework that operates purely at inference time to resolve these issues. Our method is comprised of three synergistic components. First, an intra-step refinement loop injects fine-grained motion guidance during the denoising process, iteratively correcting the output to ensure strict adherence to the target camera path. Second, an optical flow-based analysis identifies and isolates motion-related channels within the latent space. This allows our framework to selectively apply guidance, thereby decoupling motion from appearance and preserving visual fidelity. Third, a dual-path guidance strategy adaptively corrects for drift by comparing the guided generation against an unguided, reference denoising path, effectively neutralizing artifacts caused by misaligned structural inputs. These components work in concert to inject precise, trajectory-aligned control without any model retraining, achieving both accurate motion guidance and photorealistic synthesis. Our plug-and-play, model-agnostic solution demonstrates broad applicability for 3D/4D tasks. Extensive experiments confirm state-of-the-art performance in trajectory adherence and perceptual quality, outperforming both training-dependent and other inference-only methods.