From Pixel to Precision: Enhancing Handwritten Mathematical Expression Recognition with Image-Level Reward
Ze Liu ⋅ Kai Zhang ⋅ Xianquan Wang ⋅ Shuochen Liu ⋅ Jiaxian Yan ⋅ Yupeng Han ⋅ Qi Liu
Abstract
Handwritten mathematical expression recognition is hindered by a fundamental misalignment between the dual representations of LaTeX formulas: the symbolic text and the rendered visual image. This discrepancy means that textually distinct LaTeX sequences can produce visually identical outputs, while minor textual errors can cause catastrophic rendering failures. As a result, text-level reward mechanisms cannot perfectly assess the quality of model predictions, failing to effectively guide the model towards optimal performance during training. To overcome this limitation, we introduce the Image Matching Score (IMS), a lightweight yet effective reward based on the structural edit distance of column-wise image projections, which robustly quantifies the visual fidelity between rendered formulas. Leveraging IMS, we then propose Image-Matching driven Policy Optimization (IMPO), a training framework built upon Group Relative Policy Optimization (GRPO). This approach facilitates stable policy learning directly from our sequence-level visual reward, notably without the need for a separate value function network. Extensive experiments demonstrate that IMPO yields consistent performance gains across various backbone models on the challenging CROHME, HME100K, and M$^2$E datasets. Our model-agnostic framework establishes new state-of-the-art results, improving the Expression Recognition Rate by an average of 1.1% and up to 1.37% over strong prior methods. The code can be found in the supplementary materials.
Successful Page Load