Poster
BACON: Improving Clarity of Image Captions via Bag-of-Concept Graphs
Zhantao Yang · Ruili Feng · Keyu Yan · Huangji Wang · Zhicai Wang · Shangwen Zhu · Han Zhang · Jie Xiao · Pingyu Wu · Kai Zhu · Jixuan Chen · Chen-Wei Xie · Yue Yang · Hongyang Zhang · Yu Liu · Fan Cheng
Advancements in large Vision-Language Models have brought precise, accurate image captioning, vital for advancing multi-modal image understanding and processing. Yet these captions often carry lengthy, intertwined contexts that are difficult to parse and frequently overlook essential cues, posing a great barrier for models like GroundingDINO and SDXL, which lack the strong text encoding and syntax analysis needed to fully leverage dense captions.To address this, we propose BACON, a prompting method that breaks down VLM-generated captions into disentangled, structured elements such as objects, relationships, styles, and themes. This approach not only minimizes confusion from handling complex contexts but also allows for efficient transfer into a JSON dictionary, enabling models without linguistic processing capabilities to easily access key information.We annotated 100,000 image-caption pairs using BACON with GPT-4V and trained an LLaVA captioner on this dataset, enabling it to produce BACON-style captions without relying on costly GPT-4V resources. Evaluations of overall quality, precision, and recall—as well as user studies—demonstrate that the resulting caption model consistently outperforms other state-of-the-art VLM models in generating high-quality captions.Additionally, we show that BACON-style captions exhibit better clarity when applied to various models, enabling them to accomplish previously unattainable tasks or surpass existing SOTA solutions without training. For example, BACON-style captions help groundingDINO achieve 1.51 times higher recall scores on open-vocabulary object detection tasks compared to leading methods.
Live content is unavailable. Log in and register to view live content