GThinker: Towards General Multimodal Reasoning via Cue-Guided Rethinking
Abstract
Despite recent advances in multimodal reasoning, Multimodal Large Language Models (MLLMs) still struggle on complex tasks where initial visual perceptions can be misleading. This performance gap stems from a critical reasoning flaw we term Visual Inertia: while MLLMs excel at iterative reflection in textual contexts, they tend to uncritically commit to their initial visual interpretations and rarely revise them. To overcome this limitation, we introduce GThinker, an MLLM equipped with a novel adaptive visual rethinking capability. GThinker leverages Cue-Rethinking, a flexible reasoning pattern that not only grounds reasoning in visual cues but also strategically triggers a re-examination of these cues to resolve inconsistencies. To instill this capability, we introduce a novel two-stage training framework. It begins with a pattern-guided cold start, enhanced by a judge-guided selective mechanism to learn from failure cases, followed by incentive reinforcement learning. We further curate the GThinker-11k dataset to power the training with an iterative multimodal annotation pipeline. Extensive experiments demonstrate that GThinker significantly mitigates visual inertia during reasoning, achieving a leading 81.5\% on the M3CoT benchmark, which is rich in such challenges, surpassing the powerful O4-mini model. Furthermore, GThinker shows consistent improvements across a range of multimodal reasoning benchmarks with an average gain of 2.1\%, showcasing the broad benefits of equipping MLLMs with the ability to rethink both what they see and how they think.