Poster
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity
Hang Hua · Qing Liu · Lingzhi Zhang · Jing Shi · Soo Ye Kim · Zhifei Zhang · Yilin Wang · Jianming Zhang · Zhe Lin · Jiebo Luo
The advent of large Vision-Language Models (VLMs) has significantly advanced multimodal tasks, enabling more sophisticated and accurate integration of visual and textual information across various applications, including image and video captioning, visual question answering, and cross-modal retrieval.Despite their superior capabilities, VLMs still struggle with fine-grained compositional image region descriptions. Specifically, they have difficulty recognizing arbitrary segmentation masks as referential inputs, interpreting compositional aspect instructions for referencing, and precisely describing the compositional aspects of a region. However, compositionality—the ability to understand and generate novel combinations of known visual and textual components—is critical for facilitating coherent reasoning and understanding across modalities in VLMs. To address this issue, we propose OpenCompositionCap, a new dataset for multi-grained region compositional image captioning that distinguishes itself from prior works by introducing the new task of compositional aspect-aware regional image captioning. To support this endeavor, we also introduce a new VLM model, FineCaption. The empirical results illustrate the effectiveness of our proposed model compared with other strong VLMs. In addition, we analyze the capabilities of current VLMs in recognizing various visual prompts for compositional region image captioning, highlighting areas for improvement in VLM design and training.
Live content is unavailable. Log in and register to view live content