AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models
Abstract
Large multimodal models (LMMs) exhibit strong task-generalization capabilities, offering new opportunities for zero-shot visual anomaly segmentation (ZSAS). However, existing LMM-based segmentation still faces fundamental limitations: anomaly semantics are scarce and unstructured, and the weak alignment between textual prompts and visual features makes accurate anomaly localization difficult.To address these challenges, we present AG-VAS (Anchor-Guided Visual Anomaly Segmentation), a new framework that expands the LMM vocabulary with three learnable semantic anchors—[SEG], [NOR], and [ANO]—and introduces a unified anchor-guided segmentation paradigm. Specifically, [SEG] functions as an absolute semantic anchor that injects pixel-level structural priors into LMMs, while [NOR] and [ANO] serve as relative semantic anchors that encode the contrastive semantics between normality and abnormality across categories. To further enhance alignment, we introduce a Semantic-Pixel Alignment Module (SPAM) that bridges the gap between the LMM semantic space and high-resolution visual features, and design an Anchor-Guided Mask Decoder (AGMD) that performs anchor-consistent querying for precise anomaly localization.In addition, we construct Anomaly-Instruct20K, a large-scale instruction dataset that provides structured anomaly knowledge—including appearance, shape, and spatial attributes—to help LMMs effectively learn and integrate the proposed semantic anchors. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting. Code will be released upon acceptance.