LiteSense: Lifting Lightweight ToF with RGB for High-Resolution Metric Depth Estimation
Abstract
Metric depth estimation aims to recover depth maps with absolute scale, high resolution, and cross-scene consistency from visual observations. Existing approaches either rely on large-scale models or costly sensors to preserve metric accuracy and generalization, both ill-suited to resource-constrained deployment. In this paper, we propose LiteSense, a lightweight RGB-ToF fusion framework that leverages compact normalized histogram (CNH) signals together with RGB cues to achieve efficient and reliable metric depth estimation. Specifically, LiteSense leverages a U-Net-style encoder-decoder that forms an RGB-D input by concatenating RGB with upsampled ToF depth, providing explicit metric priors. To address resolution disparity and recover fine details, we introduce the Patch-wise CNH Spatial Injection (PCSI) module, which leverages zone-wise histogram measurements via cross-attention to guide high-level feature fusion. Extensively evaluated on NYUv2 and SUN RGB-D, LiteSense consistently outperforms monocular baselines and DELTAR with substantially lower computational cost, and demonstrates promising zero-shot generalization. We further introduce THDR3K, the first indoor RGB-ToF-CNH dataset, where LiteSense achieves real-world accuracy comparable to—and in challenging cases surpassing—Intel RealSense. All the relevant source codes and the collected dataset will be released.