Skip to yearly menu bar Skip to main content


MonoDiff: Monocular 3D Object Detection and Pose Estimation with Diffusion Models

Yasiru Ranasinghe · Deepti Hegde · Vishal M. Patel

Arch 4A-E Poster #98
[ ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT

Abstract: 3D object detection and pose estimation from a single-view image is challenging due to the high uncertainty caused by the absence of 3D perception. As a solution, recent monocular 3D detection methods leverage additional modalities, such as stereo image pairs and LiDAR point clouds, to enhance image features at the expense of additional annotation costs. We propose using diffusion models to learn effective representations for monocular 3D detection without additional modalities or training data. We present $MonoDiff$, a novel framework that employs the reverse diffusion process to estimate 3D bounding box and orientation. But, considering the variability in bounding box sizes along different dimensions, it is ineffective to sample noise from a standard Gaussian distribution. Hence, we adopt a Gaussian mixture model to sample noise during the forward diffusion process and initialize the reverse diffusion process. Furthermore, since the diffusion model generates the 3D parameters for a given object image, we leverage 2D detection information to provide additional supervision by maintaining the correspondence between 3D/2D projection. Finally, depending on the signal-to-noise ratio, we incorporate a dynamic weighting scheme to account for the level of uncertainty in the supervision by projection at different timesteps. $MonoDiff$ outperforms current state-of-the-art monocular 3D detection methods on the KITTI and Waymo benchmarks without additional depth priors.

Live content is unavailable. Log in and register to view live content