Skip to yearly menu bar Skip to main content


Non-autoregressive Sequence-to-Sequence Vision-Language Models

Kunyu Shi · Qi Dong · Luis Goncalves · Zhuowen Tu · Stefano Soatto

Arch 4A-E Poster #381
[ ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions.We propose a sequence-to-sequence vision-language model with a flexible hypothesis space, manifest in the training set and encoded in a layer of learnable query tokens. The architecture is trained with a novel loss, inspired by the language domain, that marginalizes over multiple inference paths in the decoder. This enables us the flexibility to adapt the hypothesis space to the task, rather than restricting to the embedding of a single token as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its autoregressive counterpart, but is faster at inference time since the decoder has to be executed once to jointly produce all output tokens, rather than sequentially to produce them one at a time. We test our model on four vision-language tasks, and perform ablation studies to single out the contribution of each component.

Live content is unavailable. Log in and register to view live content