Forecasting 3D Scanpaths in Egocentric Video
Abstract
Forecasting gaze behavior is an important task for understanding user intent and creating AR/VR systems that can anticipate where users will look and interact next. While prior works have addressed predicting scanpaths in static images, forecasting gaze in egocentric videos presents new challenges due to the dynamic nature of the scene and the camera wearer’s continuous movement through the 3D environment. To address these challenges, we formulate the novel task of egocentric scanpath prediction as forecasting a sequence of future fixations in 3D Cartesian coordinates relative to the last observed camera pose, producing a 3D scanpath that is grounded in the environment. We propose a transformer architecture that leverages egocentric video frames, head pose, and past 3D gaze observations to predict future 3D fixation sequences. We evaluate our method on the Aria Digital Twin dataset. Our findings establish a baseline for the novel task of 3D scanpath prediction and highlight important architectural elements for our task.