Skip to yearly menu bar Skip to main content


H-ViT: A Hierarchical Vision Transformer for Deformable Image Registration

Morteza Ghahremani · Mohammad Khateri · Bailiang Jian · Benedikt Wiestler · Ehsan Adeli · Christian Wachinger

Arch 4A-E Poster #181
award Highlight
[ ]
Thu 20 Jun 10:30 a.m. PDT — noon PDT


This paper introduces a novel top-down representation approach for deformable image registration, which estimates the deformation field by capturing various short- and long-range flow features at different scale levels. As a Hierarchical Vision Transformer (H-ViT), we propose a dual self-attention and cross-attention mechanism that uses high-level features in the deformation field to represent low-level ones, enabling information streams in the deformation field across all voxel patch embeddings irrespective of their spatial proximity. Since high-level features contain abstract flow patterns, such patterns are expected to positively contribute to the representation of the deformation field in lower scales. When the self-attention module utilizes within-scale short-range patterns for representation, the cross-attention modules dynamically look for the key tokens across different scales to further interact with the local query voxel patches. Our method shows superior accuracy and visual quality in deformable image registration over the state-of-the-art in five publicly available datasets, highlighting a substantial enhancement in the performance of medical imaging registration. The code and pre-trained models are available at

Live content is unavailable. Log in and register to view live content