Skip to yearly menu bar Skip to main content


Poster

Scaling Vision Pre-Training to 4K Resolution

Baifeng Shi · Boyi Li · Han Cai · Yao Lu · Sifei Liu · Marco Pavone · Jan Kautz · Song Han · Trevor Darrell · Pavlo Molchanov · Danny Yin


Abstract:

High-resolution perception of visual details is crucial for daily tasks. Current vision pre-training, however, is still limited to low resolutions (e.g., 384x384) due to the quadratic cost of processing larger images. We introduce PS3, for Pre-training with Scale-Selective Scaling, that scales CLIP-style vision pre-training to 4K resolution with a near-constant cost. Instead of processing entire global images, PS3 is pre-trained to selectively process local regions and contrast them with local detailed captions, allowing it to learn detailed representation at high resolution with greatly reduced computational overhead. The pre-trained PS3 is able to both encode the global low-resolution image and select local high-resolution regions to process based on their saliency or relevance to a text prompt. When applied to multi-modal LLMs (MLLMs), PS3 demonstrates performance that effectively scales with the pre-training resolution and significantly improves over baselines without high-resolution pre-training. We also find current benchmarks do not require recognizing details at 4K resolution, which motivates us to propose 4KPro, a new benchmark that evaluates visual perception at 4K resolution, on which PS3 outperforms state-of-the-art MLLMs, including a 13% improvement over GPT-4o.

Live content is unavailable. Log in and register to view live content