Towards Storytelling Animations: Joint Synthesis of Human and Camera Motions
Abstract
Animation relies heavily on effective cinematography to enhance narrative clarity and emotional resonance, yet crafting optimal character interactions and camera positioning remains a resource-intensive challenge. Existing methods typically require extensive, predefined datasets, which restrict their effectiveness when encountering unfamiliar character interactions or novel animation contexts. We introduce an innovative approach to jointly generate character interactions and camera placements through unconditional diffusion-based generative models. Our method leverages a unified framework to simultaneously synthesize realistic two-person motions and corresponding cinematographic compositions without relying on predefined visual datasets. By integrating 3D motion representations and Toric features, our diffusion model effectively captures spatial orientation and relative positioning, enabling coherent and expressive scene generation. Experiments demonstrate that our approach can autonomously produce diverse and plausible dual-character interactions coupled with compelling camera movements, enhancing creative flexibility in animated storytelling.