Efficient Video Object Segmentation and Tracking with Recurrent Dynamic Submodel
Abstract
The large vision foundation model, SAM2, has achieved remarkable performance in video object segmentation and tracking (VOST). However, its effectiveness is hindered by significant computational overhead. While model pruning is a widely used strategy to address this issue, traditional static and input-agnostic pruning approaches fall short in managing the diverse and complex nature of video data effectively. A promising alternative is dynamic networks, yet they often struggle to translate theoretical computational reductions into actual acceleration. Furthermore, both static and dynamic approaches typically focus on visual features of individual frames while neglecting the temporal correlations between them, limiting their performance in handling complex video streams.To address these challenges, we propose Recurrent Dynamic Submodel (RDS), a dynamic architecture that adaptively selects submodel blocks for each frame. Specifically, it has a lightweight Prediction-aware Router (PAR), which leverages both the segmentation mask from the previous frame and the visual features of the current frame to make routing decisions, enabling the submodel to explicitly capture the temporal nature of video data. Additionally, to reduce the cost of adapting the dynamic submodel, we introduce an Importance-aware LoRA (I-LoRA), tuning parameters only in the most critical blocks. Extensive experiments on various benchmarks demonstrate the effectiveness of our approach.For example, it achieves a 1.3× speedup on the DAVIS 2017 dataset with less than 1% performance degradation, while introducing only 3% (6.7M) trainable parameters and requiring only 0.003% (6.7k) of the SAM2 training data.