KAMP: Knowledge-Anchored Multimodal Pretraining Framework for Medical Image Representation
Abstract
Cross-modal biomedical signals such as pathology and genomics can provide richer and more robust semantic guidance for medical image representation. However, semantic guidance remains limited, as privacy constraints and acquisition costs severely restrict the availability of medical images paired with other biomedical data. A further challenge is modality discrepancy, which propagates intra-modal statistical bias and cross-modal noise, degrading medical image representation quality. To this end, we propose KAMP, a large language model (LLM)--driven multimodally guided pretraining framework for medical image representation learning. KAMP leverages textual priors as semantic anchors to enhance medical image representations and align medical images with multimodal biomedical data, enabling the generation of rich and robust representations even under scarce paired data. KAMP operates in three stages. First, the LLM generates personalized diagnostic knowledge from patient clinical text and imaging metadata. We inject this knowledge as a prior to enrich the medical image representation and use it as a semantic anchor to reduce the distance between the medical image representations and other biomedical modalities. Second, the LLM is optimized via the Group Relative Policy Optimization (GRPO) strategy, with the cross-modal aligner pretrained in the first stage serving as the reward model. Third, the optimized knowledge is employed to retrain the cross-modal aligner, yielding more robust medical image representations while mitigating bias and noise introduced by other modalities.Comprehensive evaluations on brain, bladder, and liver cancer datasets demonstrate that KAMP consistently outperforms existing methods on downstream few-shot prediction tasks.