RNED: Rotary Number Encoding and Decoding for Quantitative Medical VLM Analysis
Abstract
Vision-Language Models (VLMs) are increasingly adopted for medical applications, but their clinical utility is limited by a core weakness in quantitative reasoning. This limitation affects tasks ranging from regression of lesion sizes to prediction of bounding-box coordinates and stems from the discrete tokenization schemes underlying Large Language Models (LLMs). To address this, we propose \emph{Rotary Number Encoding and Decoding} (RNED), a principled method for embedding continuous numerical values directly in the representation space of a VLM. Analogous to rotary position encoding, RNED represents a scalar by applying a number-specific rotation matrix to a dedicated numeric token embedding. This norm-preserving transformation maintains ordinal structure over a wide numerical range and integrates seamlessly with pretrained model weights. For decoding, we introduce a robust score-matching–based scheme to recover continuous values from hidden states in the presence of stochastic noise. We evaluate RNED on two quantitative tasks: radiological measurement estimation and medical visual grounding. On both internal and public benchmarks, RNED consistently outperforms existing VLM baselines. Together, these results show that RNED offers a robust, generalizable solution for numerical reasoning in medical VLMs, enabling models that are both quantitatively reliable and clinically applicable. We will release code for experiments on public datasets.