PointThinker: Point-Incentivized Parallel Thinking for Multimodal Large Language Model
Abstract
This paper explores parallel thinking for Multi-modal Large Language Models (MLLMs), aiming to improve Chain-of-Thought (CoT) through multiple diverse reasoning paths. We guide the model to list multiple visual key points and develop an independent reasoning path for each. Therefore, we term this method PointThinker, which is characterized by starting each thinking path with a point. PointThinker offers two key advantages. (1) It amplifies the benefits of parallel thinking. While parallel thinking naturally benefits from multiple reasoning paths, explicitly listing key points further amplifies these benefits by eliminating redundancy and promoting path diversity, enabling the model to explore problems from more varied perspectives. (2) It uses a novel dense (point-wise) reward for reinforcement learning. We observe that during parallel thinking, some points are helpful while others are invalid, yet popular methods assign them the same rewards. Therefore, we propose allocating differentiated rewards to different points within the same chain-of-thought. This is implemented via a self-verification mechanism called Group Points Policy Optimization (GPPO), which combines rollout-level and point-level validation for reward assignment. On challenging benchmarks such as HallusionBench, PointThinker achieves 58.7% accuracy, improving reasoning quality and answer accuracy. Experimental results demonstrate that parallel thinking with point improves performance, and GPPO further contributes non-trivial gains.