Skip to yearly menu bar Skip to main content


Sequential Modeling Enables Scalable Learning for Large Vision Models

Yutong Bai · Xinyang Geng · Karttikeya Mangalam · Amir Bar · Alan L. Yuille · Trevor Darrell · Jitendra Malik · Alexei A. Efros

Arch 4A-E Poster #323
[ ]
Fri 21 Jun 10:30 a.m. PDT — noon PDT


We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. Such pure vision models can possess capabilities for broad visual reasoning, analogous to those found in Large Language Models (LLMs.) To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (420 billion tokens) is represented as sequences, the model can be trained to minimize cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable prompts at test time, showcasing remarkable generalization capabilities.

Live content is unavailable. Log in and register to view live content