Modeling the Brain’s Grammar: ROI-Guided fMRI Pretraining for Transferable and Interpretable Vision Decoding
Abstract
Recent advances in fMRI pretraining have significantly improved visual decoding accuracy by leveraging cross-subject neuroimaging datasets. A prevailing strategy aligns individual fMRI signals into a shared feature space using subject-specific adapters, followed by a shared decoder. However, this unstructured feature space overlooks the redundancy and functional correlations among voxels and fails to incorporate the brain’s intrinsic functional architecture centered on regions of interest (ROIs).To address these limitations, we propose ROITok, an ROI-guided fMRI pretraining framework. Our method introduces Sparse ROI Context Fusion to learn ROI-level visual representations and captures functional synergy between ROIs from cross-subject data. Inspired by Matryoshka Representation Learning (MRL), we design an embedding compression scheme that prioritizes the most informative visual components first, with later tokens adding progressively finer but still useful details. ROITok achieves strong transfer learning performance on the NSD and GOD datasets and shows strong resilience against high-level additive noises, while offering better interpretability and enabling new applications. It allows for quantitative assessment of each brain region’s contribution to decoding tasks. Our analysis shows that ROI-based pretraining can automatically learn the brain’s visual hierarchy. Different ROIs can provide complementary contexts for decoding tasks; combining them improves decoding robustness.