Poster
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
Sitong Gong · Yunzhi Zhuge · Lu Zhang · Zongxin Yang · Pingping Zhang · Huchuan Lu
Existing methods for Video Reasoning Segmentation rely heavily on a single special token to represent the object in the keyframe or the entire video, inadequately capturing spatial complexity and inter-frame motion. To overcome these challenges, we propose VRS-HQ, an end-to-end video reasoning segmentation approach that leverages Multimodal Large Language Models (MLLMs) to inject rich spatiotemporal features into hierarchical tokens. Our key innovations include a Temporal Dynamic Aggregation (TDA) and a Token-driven Keyframe Selection (TKS). Specifically, we design frame-level
Live content is unavailable. Log in and register to view live content