Poster Sun, Jun 7, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 312

In Pursuit of Pixel Supervision for Visual Pre-training

Lihe Yang ⋅ Shang-Wen Li ⋅ Yang Li ⋅ Xinjie Lei ⋅ Dong Wang ⋅ Abdelrahman Mohamed ⋅ Saining Xie ⋅ Hengshuang Zhao ⋅ Kaiming He ⋅ Hu Xu

Paper PDF

Abstract

Pixels provide a lightweight, scalable way to encode the physical world, preserving rich visual information with minimal human inductive bias. We demonstrate that visual pre-training using pixel supervision alone can learn desirable visual properties and produce strong representations, while remaining simple, stable, and efficient. We present Pixo, a capable self-supervised model trained by purely predicting pixels. It is instantiated on the masked autoencoding (MAE) framework, but enhances MAE with a deeper decoder, larger-block masking, and additional class tokens. It is trained on 2B web-crawled images with a self-curated strategy. Pixo performs well on many downstream tasks, covering monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), object segmentation (e.g., SAM 2), and embodied AI. We will release the training code and pre-trained models.