Skip to yearly menu bar Skip to main content


Poster

Spatial-Temporal Visual Representation for Self-Supervised Motion Planning

Yichen Xie · Runsheng Xu · Tong He · Jyh-Jing Hwang · Katie Z Luo · Jingwei Ji · Hubert Lin · Letian Chen · Yiren Lu · Zhaoqi Leng · Dragomir Anguelov · Mingxing Tan


Abstract:

The latest advancements in multi-modal large language models (MLLMs) have spurred a strong renewed interest in end-to-end motion planning approaches for autonomous driving. Many end-to-end approaches rely on human annotations to learn intermediate perception and prediction tasks, while purely self-supervised approaches—which directly learn from sensor inputs to generate planning trajectories without human annotations—often underperform the state of the art. We observe a key gap in the input representation space: end-to-end approaches built on MLLMs are often pretrained with reasoning tasks in perspective view space rather than the native 3D space that autonomous vehicles plan in. To this end, we propose PaLI-Driver, based on the popular PaLI vision-language model. PaLI-Driver uses a novel sparse volume strategy to seamlessly transform the strong visual representation of MLLMs from perspective view to 3D space without the need to finetune the vision encoder. This representation aggregates multiview and multi-frame visual inputs and enables better pre diction of planning trajectories in 3D space. To validate our method, we run experiments on both nuScenes and our in-house collected dataset X-Planning. Results show that PaLI-Driver performs favorably against existing supervised multi-task approaches while requiring no human annotations. It also demonstrates great scalability when pretrained on large volumes of unannotated driving logs.

Live content is unavailable. Log in and register to view live content