Poster
Incorporating Dense Knowledge Alignment into Unified Multimodal Representation Models
Yuhao Cui · Xinxing Zu · Wenhua Zhang · Zhongzhou Zhao · Jinyang Gao
Leveraging Large Language Models (LLMs) for text representation has achieved significant success, but the exploration of using Multimodal LLMs (MLLMs) for multimodal representation remains limited. Previous MLLM-based representation studies have primarily focused on unifying the embedding space while neglecting the importance of multimodal alignment. As a result, their cross-modal retrieval performance falls markedly behind that of the CLIP series models. To address this, in our work, we 1) construct DeKon5M, a contrastive learning dataset enriched with dense multimodal knowledge, which efficiently enhances multimodal alignment capabilities in representation tasks. 2) design a framework for training unified representation on MLLMs. Building upon this unified representation framework and the dense knowledge dataset DeKon5M, we developed the dense knowledge representation model DeKR on Qwen2VL. Through extensive quantitative and qualitative experiments, our results demonstrate that DeKR not only aligns text, image, video, and text-image combinations within a unified embedding space but also achieves cross-modal retrieval performance comparable to SoTA CLIP series models. This fully validates the effectiveness of our approach and provides new insights for multimodal representation research.
Live content is unavailable. Log in and register to view live content