Hugging Visual Prompt and Segmentation Tokens: Consistency Learning for Fine-Grained Visual Understanding in MLLMs
jing yang ⋅ Sen Yang ⋅ Boqiang Duan ⋅ Ming Dai ⋅ Wei Zhang ⋅ Xiao Tan ⋅ Kunbin Chen ⋅ Wei He ⋅ Jingdong Wang ⋅ Hanli Wang
Abstract
Recently, multimodal large language models (MLLMs) have achieved remarkable success in general multimodal tasks. Increasing attention has been given to leveraging MLLMs for fine-grained visual understanding, such as region-level captioning and pixel-level grounding.However, most existing approaches are task-specific, and some recent unified approaches attempt to handle both types simultaneously; they still fall short of deeply exploring the underlying associations across tasks. To bridge this gap, we propose a multimodal large language model designed to jointly support $\textbf{Fine-grained}$ visual understanding through $\textbf{Consistency Learning}$ (FCLM). The central idea of this work is that pixel-level captioning and grounding are mutually beneficial and complementary tasks, each enhancing the other in achieving a fine-grained understanding of visual content.Specifically, FCLM analyzes the representation features -- visual prompt and segmentation tokens -- required for the two types of visual tasks, and achieves advanced reasoning and perception through a novel-designed consistency learning loss and a two-stage training framework. Moreover, we design a Hybrid Region Extractor to enhance the quality of visual prompt embeddings, thereby obtaining more semantically discriminative representations for detailed caption generation. Additionally, to verify the MLLM’s ability to localize accurate targets from detailed textual descriptions, we introduce a novel task called Detailed Localized Referring Expression Segmentation (DL-RES).We conduct extensive experiments on seven visual understanding tasks, demonstrating the strong performance and generalization ability of FCLM.
Successful Page Load