EEGiT: Teaching Vision Transformers to Understand the EEG signal
Abstract
Decoding visual stimuli from electroencephalography (EEG) signals is a crucial step toward practical brain–computer interfaces (BCIs). However, this task requires large-scale and high-quality EEG–image paired datasets. Compared with abundant image data, the limited EEG recordings restrict the decoding models’ performance. To address this challenge, we propose EEGiT, a framework that converts sequential EEG signals into image-like EEG patches and enables the direct use of a pretrained Vision Transformer (ViT) as the EEG encoder. To preserve the spatial topology of brain regions and minimize distributional differences across channels, we group EEG electrodes according to anatomical structures and apply linear interpolation along the spatial dimension. We then resample the EEG signals to align the structure of EEG patches with that of image patches in ViT. This design encourages effective transfer of visual priors learned from large-scale image datasets to EEG representation learning. Experiments on the THINGS-EEG and EEG-3D datasets show that fine-tuning pretrained ViTs improves EEG-to-image retrieval and EEG-based visual classification, while maintaining robustness and strong cross-subject generalization. These results demonstrate a promising direction for leveraging powerful vision models to mitigate data scarcity in EEG decoding.