Poster
M-LLM Based Video Frame Selection for Efficient Video Understanding
Kai Hu · Feng Gao · Xiaohan Nie · Peng Zhou · Son Dinh Tran · Tal Neiman · Lingyun Wang · Mubarak Shah · Raffay Hamid · Bing Yin · Trishul Chilimbi
Recent advances in \acf{mllms} show promising results in video reasoning. Popular \ac{mllm} frameworks usually apply naive uniform sampling to reduce the number of video frames that are fed into an \ac{mllm}, particularly for long context videos. However, it could lose crucial context in certain periods of a video, so that the downstream \ac{mllm} may not have sufficient visual information to answer a question. To attack this pain point, we propose a light-weight \ac{mllm}-based frame selection method that adaptively select frames that are more relevant to users' queries. The selected frames are then digested by a frozen downstream \acf{videollm} for visual reasoning and question answering. In order to train the proposed frame selector, we introduce two supervision signals (i) Spatial signal, where single frame importance score by prompting a \ac{mllm}; (ii) Temporal signal, in which multiple frames selection by prompting \ac{llm} using the captions of all frame candidates. Empirical results show that the proposed \ac{mllm} video frame selector improves the performances various downstream \ac{videollm} across medium (ActivityNet, NExT-QA) and long (EgoSchema, LongVideoBench) context video question answering benchmarks.
Live content is unavailable. Log in and register to view live content