Yume1.5: A Text-Controlled Interactive World Generation Model
Abstract
Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities.To address these challenges, we propose Yume1.5, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. Yume1.5 achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation method combining unified context compression and linear attention; (2) a context compression-based bidirectional attention distillation approach with an enhanced text embedding scheme for real-time streaming video generation. Yume1.5 achieves an average generation speed of 12 fps at 540p resolution using only a single A100 GPU; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material. The model weights and full codebase will be made public.