Poster Sat, Jun 6, 2026 • 3:45 PM – 5:45 PM PDT ExHall A & F 244

Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding

Wang Chen ⋅ Yuhui zeng ⋅ Yongdong Luo ⋅ Tianyu Xie ⋅ Luojun Lin ⋅ Jiayi Ji ⋅ Yan Zhang ⋅ Xiawu Zheng

Paper PDF

Abstract

Frame selectoin is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce $\textbf{W}$avelet-based $\textbf{F}$rame $\textbf{S}$election by Detecting $\textbf{S}$emantic $\textbf{B}$oundary ($\textbf{WFS-SB}$), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts—pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, a direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations.To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by $\textbf{5.5\\% on VideoMME, 9.5\\% on MLVU, and 6.2\\% on LongVideoBench}$, consistently outperforming state-of-the-art methods.