MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks
Abstract
Real-world robotic tasks are long-horizon and often span multiple floors, requiring complex spatial reasoning. Existing embodied benchmarks, however, are largely confined to single-floor homes, failing to evaluate agents on realistic, building-scale tasks. We introduce MANSION, a language-driven framework for generating building-scale, multi-floor 3D environments for long-horizon tasks. Using this framework, we release MansionWorld, a large-scale dataset featuring over 1,000 diverse, non-residential buildings. These environments support cross-floor skills and long-horizon task generation on reusable building layouts. Experiments show that current methods degrade sharply on our multi-floor tasks, highlighting both the challenge and the value of this setting for advancing embodied AI.