Masked-Diffusion Autoencoders for 3D Medical Vision Representation Learning
Abstract
Effective medical image analysis requires representations that capture both global anatomical structure and fine-grained tissue texture. Current self-supervised approaches exhibit limited capacity to address both requirements simultaneously. Invariance-based methods learn through augmentation consistency but face challenges in medical imaging where common augmentations may discard diagnostically relevant intensity patterns. Masked image modeling approaches employ high masking ratios to enforce holistic reasoning, yet inherently limit exposure to fine-grained texture. Recent work in general-domain vision demonstrates that generative and semantic objectives can mutually benefit each other, yet this paradigm remains unexplored for 3D medical imaging. We introduce Masked-Diffusion Autoencoders (MDAE), a self-supervised framework that imposes concurrent spatial masking and diffusion corruption, encouraging the model to learn complementary objectives: masked region reconstruction for structural coherence and visible region denoising for textural characteristics. This dual corruption enables the network to learn structure-texture representations within a unified time-conditioned objective. Evaluated on brain MRI across tumor classification, molecular marker detection, and dense segmentation benchmarks, MDAE consistently outperforms state-of-the-art baselines, with improvements most pronounced in cross-modal generalization tasks.