R-C2 : Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning
Zirui Zhang ⋅ Haoyu Dong ⋅ Kexin Pei ⋅ Chengzhi Mao
Abstract
Multimodal Large Language Models (MLLMs) suffer from a fundamental "modality gap", contradicting themselves on visual versus text views of the same content. This paper argues that this inconsistency is not a failure, but a powerful resource for self-reward multimodal learning. Instead of relying on flawed voting mechanisms that amplify systematic errors when the majority is wrong, we introduce cross-modal cycle consistency as rewards C$^3$R to improve multimodal reasoning. C$^3$R performs backward inference from an answer to a query, switches modalities, and performs forward inference to verify the answer's consistency. This cycle serves as a dense, label-free reward that guides the model to resolve its own internal conflicts, while avoiding majority-is-wrong failures of standard voting methods. On standard benchmarks, C$^3$R mitigates modality-specific biases and improves reasoning accuracy by up to 7.6%. Our results show that robust reasoning emerges not just from scaling data, but from achieving a bidirectional understanding of the multimodal world.
Successful Page Load