Skip to yearly menu bar Skip to main content


Poster

Mono3DVLT: Monocular-Video-Based 3D Visual Language Tracking

Hongkai Wei · YANG YANG · Shijie Sun · Mingtao Feng · Xiangyu Song · Qi Lei · Hongli Hu · Rong Wang · Huansheng Song · Naveed Akhtar · Ajmal Mian


Abstract:

Visual-Language Tracking (VLT) is emerging as a promising paradigm to bridge the human-machine performance gap. For single objects, VLT broadens the problem scope to text-driven video comprehension. Yet, this direction is still confined to 2D spatial extents, currently lacking the ability to deal with 3D tracking in the confines of monocular video. Unfortunately, advances in 3D tracking mainly rely on expensive sensor inputs, e.g., point clouds, depth measurements, radar. Absence of language counterpart for the outputs of these mildly democratized sensors in the literature also hinders VLT expansion to 3D tracking. Addressing that, we make the first attempt towards extending VLT to 3D tracking based on monocular video. We present a comprehensive framework, introducing (i) the Monocular-Video-based 3D Visual Language Tracking (Mono3DVLT) task, (ii) a large-scale dataset for the task, called Mono3DVLT-V2X, and (iii) a customized neural model for the task. Our dataset is carefully curated, leveraging a Large Langauge Model (LLM) followed by human verification, composing natural language descriptions for 79,158 video sequences aiming at single object tracking, providing 2D and 3D bounding box annotations. Our neural model, termed Mono3DVLT-MT, is the first targeted approach for the Mono3DVLT task. Comprising the pipeline of multi-modal feature extractor, visual-language encoder, tracking decoder and a tracking head, our model sets a strong baseline for the task on Mono3DVLT-V2X. Experimental results show that our method significantly outperforms existing techniques on the Mono3DVLT-V2X dataset. Our dataset and code are available in Supplementary Material to ease reproducibility.

Live content is unavailable. Log in and register to view live content