Image-to-Point Cloud Feature Back-projection for Multimodal Training of 3D Semantic Segmentation
Abstract
The effective integration and utilization of multimodal data acquired from image cameras and LiDAR is of paramount importance for perception systems. This paper proposes Image-to-Point Cloud Feature Back-Projection (IPFP), a novel method for training multimodal fusion networks that back-projects aggregated image-feature centers (from non-projection-aligned image pixels) into the point-cloud feature set via the estimated depth map. Consequently, image features and point cloud features reside within the same three-dimensional space, enabling the natural enrichment of image information into the point cloud during the network forward pass. This process can be selectively enabled when desired -- for instance, at training time -- and turned off in the absence of multimodal data -- for example, at testing time if only LiDAR sensors are available. Experimental results demonstrate that IPFP can consistently improve state-of-the-art 3D semantic segmentation models, while retaining the ability to process LiDAR-only data at inference time.