Skip to yearly menu bar Skip to main content


Improving Bird's Eye View Semantic Segmentation by Task Decomposition

Tianhao Zhao · Yongcan Chen · Yu Wu · Tianyang Liu · Bo Du · Peilun Xiao · shi qiu · Hongda Yang · Guozhen Li · yi yang · Yutian Lin

Arch 4A-E Poster #87
[ ]
Thu 20 Jun 5 p.m. PDT — 6:30 p.m. PDT


Semantic segmentation in bird’s eye view (BEV) plays a crucial role in autonomous driving. Previous methods usually follow an end-to-end pipeline, directly predicting the BEV segmentation map from monocular RGB inputs. However, the challenge arises when the RGB inputs and BEV targets represent distinct perspectives, making the direct point-to-point translation hard to optimize. In this paper, we decompose the original BEV segmentation task into two stages, namely BEV map reconstruction and cross-view feature alignment. In the first stage, we train a BEV autoencoder to perfectly reconstruct the BEV segmentation label maps given corrupted noisy latent representation, which urges the decoder to learn fundamental knowledge of typical BEV scene patterns. The second stage involves mapping RGB input images into the pretrained BEV latent space of Stage I, directly optimizing the correlations between the two views at the feature level. Our approach simplifies the complexity of combining perception and generation into a single step, equipping the model to handle intricate and challenging scenes effectively. Besides, we propose to transform the BEV segmentation map from the Cartesian coordinate system to the polar coordinate system. This conversion allows us to correspond each RGB image column directly to BEV segmentation targets, one by one. Extensive experiments on the nuScenes and Argoverse datasets show our model clearly outperforms state-of-the-art models.

Live content is unavailable. Log in and register to view live content