CoT-Edit: Let CoT Guide Instruction Video Editing
Abstract
Text-driven instruction-based video editing in complex scenes remains challenging: purely textual prompts often fail to capture precise spatial relationships and physical constraints, resulting in target ambiguity and physically implausible outcomes. To address this, we propose a plan--guide--edit framework that explicitly bridges semantic intent and spatial execution. In our framework, a Chain-of-Thought (CoT)-enhanced multimodal large language model (MLLM) serves as a planner, performing structured reasoning over the video and instructions to derive a precise sequence of bounding boxes and attribute-enriched editing directives. These spatial priors then guide a box-conditioned mask generator, transforming ambiguous global retrieval into localized, context-aware refinement and producing masks that more accurately capture object scale, contact relationships, and placement. Building on these spatial and semantic signals, a diffusion-based editor integrates the masks, enriched instructions, and frame features to render high-fidelity edits that remain temporally coherent and spatially well aligned. Trained first in a modular manner and then jointly, our framework achieves superior performance with reduced data requirements, delivering precise localization in scenes with multiple similar objects and physically consistent object additions, and extensive experiments demonstrate state-of-the-art performance over multiple strong baseline methods.