Structural Graph Probing of Vision–Language Models
Abstract
The internal organization of vision–language models (VLMs) remains poorly understood, particularly how they distribute and fuse information across layers. We take a topology-first perspective and analyze VLMs through the interaction graphs induced by neuron–neuron correlations, treating each layer as a structured computational network rather than a sequence of token transformations. Operating solely on these graphs, we show that global connectivity patterns are strongly predictive of model behavior across grounded reasoning, counting, and hallucination tasks. Modality-separated graphs reveal that cross-modal fusion strengthens sharply in mid-to-late layers, while contrastive graph alignment exposes how multimodal training reorganizes topology relative to text-only backbones. Targeted interventions on high-degree neurons further demonstrate their causal influence, indicating that VLMs route multimodal reasoning through sparse but structurally critical hubs. These results highlight interaction topology as a powerful, model-agnostic lens for interpreting and comparing multimodal transformers.