Poster Sat, Jun 6, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 258

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

Baifeng Shi ⋅ Stephanie Fu ⋅ Long Lian ⋅ Hanrong Ye ⋅ David Eigen ⋅ Aaron Reite ⋅ Jan Kautz ⋅ Boyi Li ⋅ David Chan ⋅ Trevor Darrell ⋅ Pavlo Molchanov ⋅ Danny Yin

Highlight

Project Page Paper PDF

Abstract

Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos---they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweight module that removes redundant patches before processed by a ViT or an MLLM. Trained with next-token prediction and reinforcement learning, AutoGaze autoregressively selects a minimal set of multi-scale patches that reconstructs the video within a user-specified error threshold, eliminating redundancy while preserving information. Empirically, AutoGaze reduces visual tokens by 4x-100x and accelerates ViTs and MLLMs by up to 19x, enabling scaling MLLMs to 1K-frame 4K-resolution videos and achieving superior results on video benchmarks (e.g., 66.5% on VideoMME). Furthermore, we introduce HLVid: the first high-resolution, long-form video QA benchmark with multi-minute 4K videos, where an MLLM scaled with AutoGaze outperform the previous SOTA MLLM by 6.3%.