AGiLe: Learning Robust Long-Horizon Manipulation via Affordance-Grounded Bidirectional Latent Planning
Abstract
The robust execution of long-horizon manipulation tasks remains a central challenge in embodied intelligence, necessitating both coherent high-level planning and reliable low-level control. Existing approaches often encounter two critical limitations: the accumulation of prediction errors in subgoal planning, leading to compounding deviations over time; and the planning-execution gap, where high-level abstract plans fail to be effectively grounded in the continuous perception-action space. To address these challenges, we propose a novel unified framework, Affordance-Grounded Bidirectional Latent Planning (AGiLe). AGiLe introduces a bidirectional latent planning mechanism that jointly optimizes a backward planner and a forward critic. The backward planner generates goal-directed subgoals from the final objective, while the forward critic assesses their reachability, thereby ensuring temporal robustness through sustained consistency in long-horizon planning. Furthermore, AGiLe bridges the planning-execution gap by leveraging affordance as structural guidance, grounding abstract subgoals into dense, pixel-level visual affordances that drive action. This approach enhances spatial robustness, enabling the system to effectively adapt to semantic and visual distractors. Extensive empirical evaluations across both simulation and real-world settings confirm that AGiLe significantly outperforms strong baselines, achieving an 8.5% improvement over prior state-of-the-art methods in simulation and demonstrating its effectiveness and robustness in long-horizon manipulation tasks.