Poster
PAVE: Patching and Adapting Video Large Language Models
Zhuoming Liu · Yiquan Li · Khoi D Nguyen · Yiwu Zhong · Yin Li
We present PAVE, a framework for adapting pre-trained video large language models to downstream tasks featuring temporal supplementary signals, such as audio, camera pose, or high frame rate videos. PAVE adapts these models through patching'', introducing a small number of additional parameters and operations without modifying the base model architecture or pre-trained weights. We demonstrate that PAVE effectively adapts video LLMs for tasks including audio-visual understanding and 3D reasoning, surpassing state-of-the-art task-specific models, while using less than 1% additional parameters and FLOPs. Furthermore, when applied to high-frame-rate videos, PAVE enhances video understanding, improving the performance of strong base models. Our analysis also highlights that this framework generalizes well across different video LLMs.
Live content is unavailable. Log in and register to view live content