Skip to yearly menu bar Skip to main content


Poster

Your ViT is Secretly an Image Segmentation Model

Tommie Kerssies · Niccolò Cavagnero · Alexander Hermans · Narges Norouzi · Giuseppe Averta · Bastian Leibe · Gijs Dubbelman · Daan de Geus


Abstract: Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. Currently, to apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that leverages them to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Leveraging these findings, we introduce the Encoder-only Mask Transformer, which repurposes the plain ViT architecture to conduct image segmentation. Using large models and strong pre-training, EoMT obtains a segmentation performance similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4× faster using ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation performance and inference speed, suggesting that compute resources are better allocated to scaling the ViT itself rather than adding architectural complexity. Code will be released upon acceptance.

Live content is unavailable. Log in and register to view live content