240FPS Stereo Vision from Monocular Mixed Spikes
Abstract
Stereo vision is fundamental for enabling machines to perceive and interact with the world. While monocular stereo methods offer hardware compactness, they struggle with generalization due to reliance on data-driven priors. Binocular and multi-view systems improve accuracy but incur higher hardware complexity and data inefficiency. In this paper, we introduce a monocular solution for high-frame-rate stereo vision via temporal optical modulation. The modulation directs light from two views in a mixed manner while periodically attenuates one view at 60Hz. To capture the temporal variations introduced by this modulation, we employ a high-speed spike camera that records the mixed scene as temporally dense spikes. And the high temporal resolution of these spikes enables the construction of a linear system for efficient binocular video decoupling.Consequently, we introduce a two-stage decoding methodology for achieving high-quality stereo vision: An efficient least-squares based baseline reconstruction followed by a deep learning refinement module. Experimental results demonstrate that our approach achieves 240FPS binocular video reconstruction with superior accuracy compared to monocular systems, while maintaining the hardware compactness and data efficiency.