Image Guides Images: Consistent Video Amodal Completion with Rectified In-Context Exemplar Guidance
Abstract
Video amodal completion (VAC) aims to mimic the human brain's ability to implicitly perceive the complete appearance of partially occluded objects, thereby facilitating recognition and understanding. Existing VAC methods finetune video generation models on custom datasets, yet these datasets often have unrealistic distributions and small scales due to the challenges of collecting real amodal data and thus limit their performance and generalization.To address this, we utilize pre-trained image inpainting models for VAC and introduce in-context (IC) learning to enhance inter-frame consistency. However, despite the satisfactory performance of DiT-based IC Learning in generation tasks, task-agnostic global information often utilizes irrelevant scene information, resulting in completion failures when applied to amodal completion task. Additionally, IC Learning faces a cold-start problem with the exemplar construction. To this end, we propose a consistency video amodal completion with rectified in-context exemplar guidance. Specifically, we introduce rectified exemplar-guided completion by adjusting the attention weights of exemplar image relative to the target images for consistent completion, and adopt a dual-frame calibrated exemplar rectification to tackle the cold-start issue.Quantitative and qualitative experiments demonstrate that our method outperforms SOTAs, especially in terms of generalization and robustness on uncommon data and under severe occlusion.