Poster
MaskDiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation
Tianhao Qi · Jianlong Yuan · Wanquan Feng · Shancheng Fang · Jiawei Liu · SiYu Zhou · Qian HE · Hongtao Xie · Yongdong Zhang
[
Abstract
]
Abstract:
Sora has unveiled the immense potential of the Diffusion Transformer (DiT) architecture in single-scene video generation.However, the more challenging task of multi-scene video generation, which offers broader applications, remains relatively underexplored.To address this gap, we introduce MaskDiT, a novel approach that ensures fine-grained, one-to-one alignment between video segments and their corresponding text annotations.Specifically, we introduce a symmetric binary mask at each attention layer within the DiT architecture.This mask ensures that each text annotation is applied exclusively to its corresponding video segment, while preserving temporal coherence across all visual tokens.With this attention mask facilitating fine-grained, segment-level textual-to-visual alignment, we adapt the DiT architecture for video generation tasks involving a fixed number of scenes.To further equip the DiT architecture with the capability for generating videos with additional scenes, we incorporate a segment-level conditional mask that treats preceding video segments as context for the final segment, thereby enabling auto-regressive scene extension.Both qualitative and quantitative experiments confirm that MaskDiT excels in maintaining visual consistency across segments while ensuring semantic alignment between each segment and its corresponding text description.
Live content is unavailable. Log in and register to view live content