Are Image-to-Video Models Good Zero-Shot Image Editors?
Abstract
Large-scale video diffusion models exhibit strong world-simulation and temporal reasoning capabilities, yet their potential as zero-shot image editors remains underexplored. We present \ifedit{IF-Edit} (\textbf{I}mage Edit by Generating \textbf{F}rames), a tuning-free framework that repurposes pre-trained image-to-video diffusion models for instruction-driven image editing. \ifedit{IF-Edit} addresses three core obstacles—prompt misalignment, redundant temporal latents, and blurry late-stage frames—via: (1) a Chain-of-Thought Prompt Enhancement module that reformulates static editing instructions into temporally grounded reasoning prompts; (2) a Temporal Latent Dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving global semantics and temporal coherence; and (3) a Self-Consistent Post-Refinement step that refines the sharpest late-stage frame through a brief still-video trajectory, leveraging the video prior for sharper and more faithful results. Extensive experiments across four public benchmarks—covering non-rigid deformations, physical and temporal reasoning, and general instruction editing—show that \ifedit{IF-Edit} achieves strong performance on non-rigid and reasoning-centric tasks while remaining competitive on general-purpose edits. Our study offers a systematic view of video diffusion models as image editors, revealing their unique strengths, limitations, and a simple recipe for unified video–image generative reasoning.