Skip to yearly menu bar Skip to main content


Vision-and-Language Navigation via Causal Learning

Liuyi Wang · Zongtao He · Ronghao Dang · mengjiao shen · Chengju Liu · Qijun Chen

Arch 4A-E Poster #338
[ ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


In the pursuit of robust and generalizable environment perception and language understanding, the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents, hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision, language, and history, we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally, to capture global confounder features, we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning, which is also shown to be effective in improving cross-modal feature learning during pre-training. Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at

Live content is unavailable. Log in and register to view live content