Free-Lunch Long Video Generation via Layer-Adaptive O.O.D Correction
Abstract
Generating long videos using pre-trained video diffusion models, which are typically trained on short clips, presents a significant challenge. Directly applying these models for long-video inference often leads to a notable degradation in visual quality. This paper identifies that this issue primarily stems from two out-of-distribution (O.O.D) problems: frame-level relative position O.O.D and context-length O.O.D. To address these challenges, we propose a novel training-free, layer-adaptive framework. The core of our approach is the observation that different layers within the model exhibit varying sensitivities to these two O.O.D. issues. We first introduce a systematic probing procedure to quantify each layer's sensitivity. Based on the results, we apply a tailored, layer-wise strategy. For layers sensitive to relative positions, we propose a novel multi-granularity video-based relative position re-encoding (VRPR) scheme. For layers sensitive to context length, we utilize a tiered sparse attention (TSA) mechanism combined with an attention sink. Extensive experiments show that our method achieves state-of-the-art performance in long video generation. Importantly, our framework can be seamlessly integrated into various leading video diffusion models without any additional training.