Poster Sat, Jun 6, 2026 • 3:45 PM – 5:45 PM PDT ExHall A & F 371

RecEdit-Drive: 3D Reconstruction-Guided Spatiotemporal Video Editing for Autonomous Driving Scenes

Yipeng Wu ⋅ Xin WANG ⋅ Chenghan Yang ⋅ Chong Wang ⋅ Dongdong Wu ⋅ Wanchao Su ⋅ Hengshuang Zhao ⋅ Wei Feng ⋅ Kairui Yang ⋅ Di Lin

Abstract

High-quality video editing and processing are crucial in domains such as filmmaking and autonomous driving, where accurate visual refinement and data preparation are essential. However, it is challenging to achieve precise control over dynamic objects while maintaining spatiotemporal consistency. Current approaches typically utilize text prompts or 2D structural priors for video editing to ensure consistency, yet they struggle to effectively constrain the spatial variations of dynamic 3D objects. In this paper, we introduce $\textbf{RecEdit-Drive}$, a framework that integrates $\textbf{Spatial Feature Warping}$ and $\textbf{Spatiotemporal Collaborative Modeling}$ to effectively control 3D object variations and enhance video consistency. The spatial feature warping enhances precise control over the edited foreground 3D objects, enhancing spatial consistency in the generated videos; and the spatiotemporal collaborative modeling seamlessly integrates edited foreground objects into the background, yielding realistic and consistent edited videos. Besides, we design an inference strategy to reconstruct an accurate background structure through noise manipulation, providing a reliable reference for foreground instance editing at early denoising stages. We perform extensive qualitative and quantitative comparisons regarding general video editing and downstream tasks on the public datasets, demonstrating the state-of-the-art performance of our proposed method.