Skip to yearly menu bar Skip to main content


Poster

GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model

Yue Han · Jiangning Zhang · Junwei Zhu · Runze Hou · Xiaozhong Ji · Chuming Lin · Xiaobin Hu · Xuezhucun Xue · Yong Liu


Abstract:

Multimodal Language Learning Models (MLLMs) have shown remarkable performance in image understanding, generation, and editing, with recent advancements achieving pixel-level grounding with reasoning. However, these models for common objects struggle with fine-grained face understanding. In this work, we introduce the \textbf{\textit{FacePlayGround-240K}} dataset, the first pioneering large-scale, pixel-grounded face caption and question-answer (QA) dataset, meticulously curated for alignment pretraining and instruction-tuning. We present the \textbf{\textit{GroundingFace}} framework, specifically designed to enhance fine-grained face understanding. This framework significantly augments the capabilities of existing grounding models in face part segmentation, face attribute comprehension, while preserving general scene understanding. Comprehensive experiments validate that our approach surpasses current state-of-the-art models in pixel-grounded face captioning/QA and various downstream tasks, including face captioning, referring segmentation, and zero-shot face attribute recognition.

Live content is unavailable. Log in and register to view live content