Poster
DTOS: Dynamic Time Object Sensing with Multimodal Large Language Model
Jirui Tian · Jinrong Zhang · Shenglan Liu · Luhao Xu · Zhixiong Huang · Gao Huang
Existing multimodal large language models (MLLM) face significant challenges in Referring Video Object Segmentation(RVOS). We identify three critical challenges: (C1) insufficient quantitative representation of textual numerical data, (C2) repetitive and degraded response templates for spatiotemporal referencing, and (C3) loss of visual information in video sampling queries lacking textual guidance. To address these, we propose a novel framework, \textbf{Dynamic Time Object Sensing (DTOS)}, specifically designed for RVOS. To tackle (C1) and (C2), we introduce specialized tokens to construct multi-answer response templates, enabling regression of event boundaries and target localization. This approach improves the accuracy of numerical regression while mitigating the issue of repetitive degradation. To address (C3), we propose a Text-guided Clip Sampler (TCS) that selects video clips aligned with user instructions, preventing visual information loss and ensuring consistent temporal resolution. TCS is also applicable to Moment Retrieval tasks, with enhanced multimodal input sequences preserving spatial details and maximizing temporal resolution. DTOS demonstrates exceptional capability in flexibly localizing multiple spatiotemporal targets based on user-provided textual instructions. Extensive experiments validate the effectiveness of our approach, with DTOS achieving state-of-the-art performance in J&F scores: an improvement of +4.36 on MeViS, +4.48 on Ref-DAVIS17, and +3.02 on Ref-YT-VOS. Additionally, our TCS demonstrates exceptional performance in Moment Retrieval. All code, models, and datasets will be made publicly available.
Live content is unavailable. Log in and register to view live content