From Attraction to Equilibrium: Physics-Inspired Semantic Gravitons for Zero-Shot Anomaly Detection
Abstract
Zero-shot anomaly detection (ZSAD) aims to detect unseen anomalies without any abnormal supervision, which is crucial for open-world scenarios where anomalies are diverse and unpredictable. By expressing normal and abnormal concepts in natural language, recent vision–language models such as CLIP enable anomaly reasoning through shared visual–textual embeddings. However, existing approaches rely on coarse prompt fusion, resulting in unstable alignment and inaccurate localization under domain shifts. To overcome these challenges, we propose the Semantic Graviton Network (SGNet), a physics-inspired framework that models multimodal alignment as an adaptive potential field. We introduce semantic gravitons, learnable dynamic mediators that bridge visual and textual modalities by establishing localized semantic equilibria through attraction and equilibrium forces. Within this framework, a graviton interaction network alternately performs text-to-graviton and vision-to-graviton coupling, progressively refining multimodal correspondence and promoting structured semantic binding. Furthermore, an energy-based potential regularization, composed of attraction and equilibrium forces, constrains the evolution of these interactions, ensuring stability and interpretability in the learned representations. Extensive experiments on ten industrial and medical benchmarks demonstrate that SGNet achieves state-of-the-art zero-shot anomaly detection performance.