FloVerse: Floor Plan-Guided Multi-Modal Navigation
weiqi Huang ⋅ Shuangyi Dong ⋅ Jiaxin Li ⋅ Yifei Guo ⋅ Zan Wang ⋅ Wei Liang
Abstract
Floor plans encapsulate compact spatial priors, enabling agents to navigate unseen scenes more efficiently. While prior work has explored floor plan–guided navigation, it has focused mainly on PointNav and a limited set of environments. To bridge this gap, we introduce FloVerse, a new task for floor plan–guided embodied navigation that unifies PointNav, ObjectNav, and ImageNav. To support this FloVerse, we assemble FloVerse-1.6K, a large-scale dataset of 1.6K scenes from HM3D and Gibson $4$+, paired with corresponding floor plans, comprising 240K expert trajectories and 12M RGBD frames. We further propose ThreeDiff, a two-stage imitation learning policy consisting of a planner—a diffusion-based multimodal goal-reasoning module trained via masked-modality modeling—and a refiner, a depth-based trajectory refinement module for safe execution. Extensive experiments show that (1) floor plan priors consistently improve navigation performance across all goal modalities, and (2) ThreeDiff implicitly learns to infer goal locations from diverse goal representations through spatial reasoning. These results highlight the effectiveness of structured spatial priors and our unified approach for floor plan–guided embodied navigation.
Successful Page Load