Poster
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for accelerating Large VLMs
Wangbo Zhao · Yizeng Han · Jiasheng Tang · Zhikai Li · Yibing Song · Kai Wang · Zhangyang Wang · Yang You
Vision-language models (VLMs) have shown remarkable success across various multi-modal tasks, yet large VLMs encounter significant efficiency challenges due to processing numerous visual tokens. A promising approach to accelerating large VLM inference is using partial information, such as attention maps from specific layers, to assess token importance and prune less essential tokens. However, our study reveals three key insights: (i) Partial attention information is insufficient for accurately identifying critical visual tokens, resulting in suboptimal performance, especially at low token retention ratios; (ii) Global attention information, such as the attention map aggregated across all layers, more effectively preserves essential tokens and maintains performance under aggressive pruning. However, it requires a full inference pass, which increases computational load and is therefore impractical in existing methods; and (iii) The global attention map aggregated from a small VLM closely resembles that of a large VLM, suggesting an efficient alternative. Based on these findings, we introduce \underline{\textbf{S}}mall VLM \underline{\textbf{G}}uidance for \underline{\textbf{L}}arge VLMs (\textbf{SGL}). Specifically, we employ the aggregated attention map from a small VLM guide the pruning of visual tokens in a large VLM. Additionally, we develop a small VLM early exiting mechanism to make full use of the small VLM's predictions, dynamically invoking the larger VLM only when necessary, yielding a superior trade-off between accuracy and computational cost. Extensive evaluations across 11 benchmarks demonstrate the effectiveness and generalizability of our method, achieving up to 91\% pruning ratio for visual tokens while retaining competitive performance.
Live content is unavailable. Log in and register to view live content