Poster
Recreating 1940s Tom and Jerry with Test-Time Training
Jiarui Xu · Shihao Han · Karan Dalal · Daniel Koceja · Yue Zhao · Ka Chun Cheung · Yejin Choi · Jan Kautz · Yu Sun · Xiaolong Wang
We present a novel framework for generating long-form cartoon videos, specifically focusing on recreating the classic "Tom and Jerry" series. While recent advances in video generation have shown promising results for short clips, generating long videos with coherent storylines and dynamic motions remains challenging with high computation costs. We propose a hybrid framework that combines local self-attention with a Test-Time Training (TTT) based global attention mechanism, enabling our model to process and maintain consistency across significantly longer temporal context windows. We develop a new dataset curation pipeline specifically designed for long-form cartoon videos, combining human annotations for complex motion dynamics with Vision-Language Models for detailed descriptions. Our pipeline captures the exaggerated movements and dynamic camera work characteristic of "Tom and Jerry". Experiments show that our approach outperforms existing methods in generating long-form animated content with plausible motion and consistent storylines.
Live content is unavailable. Log in and register to view live content