Skip to yearly menu bar Skip to main content


Object Recognition as Next Token Prediction

Kaiyu Yue · Bor-Chun Chen · Jonas Geiping · Hengduo Li · Tom Goldstein · Ser-Nam Lim

Arch 4A-E Poster #199
award Highlight
[ ] [ Project Page ]
Thu 20 Jun 5 p.m. PDT — 6:30 p.m. PDT

Abstract: We present an approach to pose object recognition as next token prediction.The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels.To ground this prediction process in auto-regression, we customize a non-causal attention mask for the decoder, incorporating two key features: modeling tokens from different labels to be independent, and treating image tokens as a prefix.This masking mechanism inspires an efficient method $-$ one-shot sampling $-$ to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference.To further enhance the efficiency, we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model.This approach yields a decoder that matches the full model's performance while being notably more efficient.The code is available at [](

Live content is unavailable. Log in and register to view live content