ReLaX: Reasoning with Latent Exploration for Large Reasoning Models
Shimin Zhang ⋅ Xianwei Chen ⋅ Yufan Shen ⋅ Ziyuan Ye ⋅ Jibin Wu
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated remarkable potential in enhancing the reasoning capability of Large Reasoning Models (LRMs). However, RLVR often leads to premature policy convergence, resulting in early exploitation and performance saturation. While manipulating token-level entropy has proven effective for promoting early exploration, we argue that the latent dynamics underlying token generation provide richer computational structure for guiding policy optimization. To characterize the nonlinear latent structure of LRM and further facilitate measurement and manipulation on a tractable representation space, we leverage the Koopman operator theory to linearize the hidden state dynamics. We then introduce a new metric, $\textbf{D}$ynamic $\textbf{S}$pectral $\textbf{D}$ispersion ($\textbf{DSD}$),to quantify the diversity of the model's reasoning dynamics, which also serves as a direct measure of the degree of exploration. Building upon these foundations, we introduce a latent dynamics aware training paradigm, $\textbf{Re}$asoning with $\textbf{La}$tent e$\textbf{X}$ploration ($\textbf{ReLaX}$), to attain a better balance between exploration and exploitation during policy optimization. With the proposed ReLaX, we achieve state-of-the-art results across $7$ multimodal benchmarks and multidisciplinary reasoning benchmarks. Furthermore, comparative analysis reveals that ReLaX's mechanism of adaptive and semantically meaningful exploration cultivates more structured and robust reasoning than methods that merely optimize for token-level entropy.
Successful Page Load