OmniLottie: Generating Vector Animations via Parameterized Lottie Tokens
Abstract
OmniLottie is a versatile framework that generates high-quality vector animations from multi-modal instructions, including interleaved texts, images, and videos. To fully parameterize vector animations for flexible motion and visual content control, we seek help from the Lottie representation, which encodes both shapes and animated behaviors in a single JSON file. Building upon a pretrained vision–language model (VLM), OmniLottie produces vivid, semantically aligned vector animations that adhere closely to multi-modal conditions. To avoid the complexity and irregularity of raw JSON structures, we introduce a dedicated Lottie tokenizer that transforms Lottie files into structured sequences of function calls representing shapes, animation commands, and their parameters. This design enables the model to directly learn the underlying shape and animation priors from data, substantially improving generation stability and controllability. To further advance research in vector animation generation, we curate MMLottie-2M, a large-scale dataset of professionally designed vector animations paired with textual and visual annotations. Leveraging the well-designed tokenizer and our newly established dataset, OmniLottie demonstrates strong multi-modal conditional generation capabilities using a simple next-token prediction objective. For qualitative results, please refer to the generated animations rendered through standard Lottie players on the supplementary website.