Illuminating Visual Identity in Universal Multimodal Embeddings
Jiawei Cao ⋅ Junyi Feng ⋅ Jiashen Hua ⋅ Ziheng Huang ⋅ Bing Deng ⋅ Kaijie Wu ⋅ Chaochen Gu ⋅ Jieping Ye
Abstract
Universal Multimodal Embeddings (UMEs) aim to unify various modalities and tasks into a shared representation space. In recent years, this field has witnessed substantial progress driven by the development of Multimodal Large Language Models (MLLMs). However, a crucial capability, visual identity discrimination, remains underexplored in existing UME methods, despite its critical role in a wide range of tasks, including instance retrieval, re-identification, and identity preservation in AI-generated content (AIGC).To bridge this gap, we propose a unified formulation for visual identity discrimination and introduce $\textbf{MIEB}$ ($\textbf{M}$ultimodal Visual $\textbf{I}$dentity $\textbf{E}$mbedding $\textbf{B}$enchmark), a large-scale benchmark curated from both real-world and synthetic datasets to support evaluation and training.Furthermore, we present a simple yet effective learning framework that jointly optimizes general multimodal and visual identity representations through a carefully designed identity-aware sampling mechanism.Extensive experiments demonstrate that our approach successfully endows UMEs with strong identity discrimination capability and maintains competitive general multimodal performance.We believe this work not only illuminates a critical yet neglected capability, but also takes a step toward more holistic universal multimodal embeddings.
Successful Page Load