Predict Before You Explore: Predictive Planning with Specialized Memory for Embodied Question Answering
Abstract
Embodied Question Answering (EQA) requires agents to navigate 3D environments, accumulate visual evidence, and reason over partial observations to answer questions. However, current agents struggle to maintain coherent, long-horizon behavior: planning remains reactive, causing inconsistent actions, while monolithic memories entangle all observations, hindering retrieval of the sparse but crucial evidence. We address these issues by reframing EQA through the lens of predictive processing, in which coherent behavior emerges from a prediction–correction loop grounded in stable priors. Guided by this perspective, we propose Predict Before You Explore (Pred-EQA), an architecture that integrates predictive planning with specialized memory. A high-level planner predicts where question-relevant evidence is likely to appear and generates a compact set of actionable exploration branches encoding long-horizon intent. A low-level executor then reduces uncertainty within these branches, revising predictions when they fail. A dual-memory system complements this process by separating slowly evolving structural priors from compact, question-relevant visual evidence, enabling consistent planning and efficient evidence accumulation. Through this prediction-guided exploration, Pred-EQA achieves coherent trajectories under partial observability. Experiments on OpenEQA and Express-Bench show that Pred-EQA achieves state-of-the-art results in both accuracy and exploration efficiency, demonstrating the benefits of prediction-driven embodied reasoning.