Revisiting Monocular SLAM with Spatio-Temporal Scene Modeling
Abstract
Visual SLAM is one of the most fundamental problems in computer vision, with direct applications to real-time localization tasks such as AR/VR, robotics, and 3D scene reconstruction. Although significant progress has been made in both sparse and dense approaches, real-time monocular SLAM remains challenging—particularly in the uncalibrated setting, where existing methods are often inefficient and lack modularity. In this paper, we present a new visual SLAM pipeline implemented from scratch in C++ that explicitly leverages the spatio-temporal structure of the scene for improved localization, and is designed to be modular so that off-the-shelf components can be easily integrated. We introduce a temporal representation based on a buffer of recent keyframes that preserves short-term scene continuity. To complement this, we incorporate a spatial representation based on a 3D cell-based scene model, enabling efficient retrieval of relevant 3D points from previously reconstructed regions. Leveraging recent feed-forward geometry estimators, our hybrid design combines sparse keypoint-based localization with a dense anchor-point–driven spatial representation. This integration allows us to achieve real-time performance (exceeding 80 FPS) and a substantial efficiency improvement compared to existing uncalibrated monocular SLAM pipelines, while maintaining or improving localization accuracy.