GaussianVision: Vision-Language Alignment from Compressed Image Representations using 2D Gaussian Splatting
Yasmine Omri ⋅ Connor Ding ⋅ Tsachy Weissman ⋅ Thierry Tambe
Abstract
Modern vision–language pipelines are driven by RGB vision encoders trained on massive image–text corpora. While these pipelines have enabled impressive zero-shot capabilities and strong transfer across tasks, they still inherit two structural inefficiencies from the pixel domain: (i) transmitting dense RGB images from edge devices to the cloud is energy-intensive and costly, and (ii) patch-based tokenization explodes sequence length, stressing attention budgets and context limits. We explore 2D Gaussian Splatting (2DGS) as an alternative visual substrate for alignment: a compact, spatially adaptive representation that parameterizes images by a set of colored anisotropic Gaussians. We develop a scalable 2DGS pipeline with structured initialization, luminance-aware pruning, and batched CUDA kernels, achieving over $90\times$ faster fitting and $\sim97$% GPU utilization compared to prior implementations. We further adapt contrastive language-image pre-training (CLIP) to 2DGS by reusing a frozen RGB-based transformer backbone with a lightweight splat-aware input stem and a perceiver resampler, training only $\sim9.7 - 13.8$% of the total parameters.On a 12.8M dataset from DataComp, GS encoders yield competitive zero-shot performance on 38 datasets from the CLIP benchmark while compressing inputs $3$–$23.5\times$ relative to pixels. Our results establish 2DGS as a viable multimodal substrate, pinpoint architectural bottlenecks, and open a path toward representations that are both semantically powerful and transmission-efficient for edge–cloud learning.
Successful Page Load