Poster
HyperSeg: Hybrid Segmentation Assistant with Fine-grained Visual Perceiver
Cong Wei · Haoxian Tan · Yujie Zhong · Yong Liu · Jie Hu · Dengjie Li · Zheng Zhao · Yujiu Yang
This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). Despite significant progress in current unified segmentation methods, limitations in adaptation to both image and video scenarios, as well as the complex reasoning segmentation, make it difficult for them to handle various challenging instructions and achieve accurate understanding of fine-grained visual text correlations. We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception, encompassing generic segmentation tasks and more complex reasoning perception tasks requiring challenging reasoning abilities and world knowledge. Besides, to fully leverage the recognition capacity of VLLMs and the fine-grained visual information, HyperSeg incorporates hybrid entity recognition and fine-grained visual perceiver modules for distinct segmentation tasks. Combined with temporal adapter, HyperSeg chieves a comprehensive understanding of space-time information. Experimental results validate the effectiveness of our insights in resolving universal image and video segmentation tasks, including the more complex reasoning perception tasks.
Live content is unavailable. Log in and register to view live content