Poster
Apollo: An Exploration of Video Understanding in Large Multi-Modal Models
Orr Zohar · Xiaohan Wang · Yann Dubois · Nikhil Mehta · Tong Xiao · Philippe Hansen-Estruch · Licheng Yu · Xiaofang Wang · Felix Juefei-Xu · Ning Zhang · Serena Yeung · Xide Xia
[
Abstract
]
Abstract:
Despite the rapid integration of video perception capabilities into Large Multi-modal Models (LMMs), what drives their video perception remains poorly understood.Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that uncovers what effectively drives video understanding in LMMs.We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover *Scaling Consistency*, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more.Guided by these findings, we introduce **Apollo**, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models process over 1-hour videos efficiently, with the 3B parameter variant outperforming most existing 7B models. **Apollo**-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME. Our code and models will be made available at publication.
Live content is unavailable. Log in and register to view live content