Scaling Parallel Sequence Models to Vision Foundation Models
Abstract
Scaling vision foundation models is constrained by the quadratic complexity of self-attention. Although subquadratic attention alternatives like linear attention variants and state-space models successfully reduce the model complexity, they typically serialize images into 1D token sequences, compromising spatial coherence and efficiency. Generalized Spatial Propagation Networks (GSPN) offer a linear-time alternative that propagates context directly on the 2D grid via line-scan propagation and removes positional embeddings, yet the original design hits GPU-scaling limits: growing batch/channels saturate SM concurrency, serializing scans, and spiking latency. We introduce Compact GSPN (C-GSPN), a ViT block that compresses the propagation space to preserve accuracy while cutting propagation latency by nearly 10×. We further improve efficiency with lightweight projections and fused CUDA kernels. To enable large-scale pretraining, we adopta two-stage cross-operator distillation strategy that combines layer-wise supervision with end-to-end alignment. In a representative 1K configuration (batch 32, C=1152), C-GSPN achieves up to 2× speedup, maintains competitive zero-shot accuracy, and improves segmentation by +2.1%. Extensive experiments and ablations show that the proposed compression and two-stage distillation are criticalfor strong transfer while substantially reducing compute, enabling the first extension of a subquadratic operator to foundation-scale (CLIP-style) vision pretraining.