Hyperbolic Gramian Volumes for Multimodal Alignment
Abstract
Multimodal contrastive learning typically relies on pairwise similarities for alignment, but recent work has shown that Gramian volumes can capture higher-order correlations across modalities.However, Euclidean Gramian volumes suffer from volume collapse under L2 normalization, concentrating near unity with minimal discriminative variance.Hyperbolic geometry's exponential volume growth naturally addresses this via variance preservation, motivating us to extend Gramian alignment to hyperbolic space.Yet preliminary experiments reveal that pure hyperbolic geometry alone is insufficient: while it preserves variance, it underperforms Euclidean baselines on cross-category discrimination.We introduce HyperGRAM, a hybrid geometry framework that combines Euclidean discriminative stability with hyperbolic semantic variance through learnable mixing.Using the numerically stable Lorentz model, HyperGRAM enables volumes to serve dual roles: discriminating matched from mismatched triplets while preservingsemantic sensitivity within matched pairs that reflects interpretation spaces (the set of valid multimodal realizations).Evaluation across four video-text benchmarks demonstrates that hybrid geometry consistently outperforms both pure Euclidean and pure hyperbolic variants, achievingsignificant zero-shot improvements with cross-dataset semantic sensitivity exhibiting contrasting correlation patterns.