Poster Sun, Jun 7, 2026 • 10:45 AM – 12:45 PM PDT ExHall F 309

Hi-Lo Prune: Look at What You'll Lose before Pruning with Hierarchical Token Selection

Zixun Sun ⋅ Yubo Dong ⋅ Hehe Fan ⋅ Yi Yang

Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision–language understanding, yet processing long visual token sequences remains computationally expensive. Existing approaches mitigate this cost by reducing image tokens, either by discarding them after the visual encoder or by pruning them in the early Transformer layers of the LLM. While these strategies improve efficiency, they inevitably discard informative visual content and risk degrading downstream reasoning performance. To address this challenge, we introduce Hi-Lo Prune, a training-free pruning strategy that is built on a simple principle: look at what you will lose before you remove it. Instead of directly dropping tokens, Hi-Lo Prune first identifies which tokens are safe to prune through a coarse-to-fine selection process, and then encourages the model to absorb their information before pruning occurs. Specifically, the framework consists of three stages: (1) Hierarchical Pruning Token Selection. After visual encoding, we apply a coarse-to-fine process that identifies tokens to retain and selects a critical set of pruning candidates from redundant ones. (2) Attention-Guided Candidate Token Merge. Before removing selected tokens, an attention mechanism is applied to the early LLM layers, which explicitly transfers information from these candidates to the retained tokens. (3) Low-informative Candidate Token Removal. At a designated Transformer layer, the pruned tokens are removed, reducing computation for all subsequent layers. This design enables aggressive early-layer pruning while preserving critical visual cues. Experiments on Qwen2-VL, Qwen2.5-VL, and Qwen3-VL demonstrate that Hi-Lo Prune consistently outperforms existing pruning methods across multiple benchmarks, achieving strong performance even under high pruning ratios without any fine-tuning. The code has been submitted as supplementary material and will be made publicly available.