Grounded Chain-of-Thought for Multimodal Large Language Models
Abstract
Despite great progress, existing multimodal large language models (MLLMs) are still inferior in visual-spatial reasoning, which greatly impedes their trustworthy applications in scenarios such as Embodied AI. To facilitate the research, we propose a new MLLM task in this paper, called Grounded Chain-of-Thought (GCoT). Different from recent visual CoT studies, which focus more on visual knowledge reasoning, GCoT aims to improve the visual-spatial reasoning capabilities of MLLMs via recognizing and grounding the relevant visual cues step by step, which are also supported by step-vise grounding coordinates as the intuitive basis. To facilitate this task, we also carefully design and construct a benchmark called multimodal grounded chain-of-thought (MM-GCoT). Besides, a comprehensive consistency evaluation system is also introduced, including the metrics of answer accuracy, grounding accuracy and answer-grounding consistency. We further design and conduct a bunch of experiments on 12 advanced MLLMs, and reveal some notable findings: i. most MLLMs performs poorly on the consistency evaluation, indicating obvious visual hallucination; ii. visual hallucination is not directly related to the parameter size and general multimodal performance; iii. a larger and stronger MLLM is not less affected by this issue.