Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation
Abstract
Transformers rely on explicit positional encoding to model structure in data. WhileRotary Position Embedding (RoPE) excels in 1D domains, its application to imx0002age generation reveals significant limitations such as fine-grained spatial relationmodeling, color cues, and object counting. This paper identifies key limitationsof standard multi-dimensional RoPE—rigid frequency allocation, axis-wise index0002pendence, and uniform head treatment—in capturing the complex structural biasesrequired for fine-grained image generation. We propose HARoPE, a head-wiseadaptive extension that inserts a learnable linear transformation parameterized viasingular value decomposition (SVD) before the rotary mapping. This lightweightmodification enables dynamic frequency reallocation, semantic alignment of rotaryplanes, and head-specific positional receptive fields while rigorously preservingRoPE’s relative-position property. Extensive experiments on class-conditional Imax0002geNet and text-to-image generation (Flux and MMDiT) demonstrate that HARoPEconsistently improves performance over strong RoPE baselines and other extenx0002sions. The method serves as an effective drop-in replacement, offering a principledand adaptable solution for enhancing positional awareness in transformer-basedimage generative models.