UNI-OOD: Unified Object- and Image-level Out-of-Distribution Detection via Cross-Context Attentive Vision-Language Modeling
Abstract
Out-of-distribution (OOD) detection is a key requirement for reliable deployment in open-world environments, where a model must recognize inputs that fall outside the semantic scope of known concepts. While recent advances in vision–language models (VLMs) have achieved strong results in image-level OOD detection, most methods still assume that each image contains a single dominant object. This assumption severely limits their applicability to real-world settings where scenes are naturally composed of multiple objects that each demands independent OOD assessment. Existing object-level approaches, including the current SOTA method RUNA, remain constrained by coarse global representations and insufficient modeling of contextual dependencies between objects and their backgrounds. We propose UNI-OOD, a unified framework that performs both object- and image-level OOD detection within a single vision–language model, without requiring prior knowledge of which task is being addressed at inference time. The key idea is to leverage cross-context attentive modeling that captures complementary visual and textual semantics. UNI-OOD learns to attend to fine-grained spatial details within each object, aligns visual and linguistic embeddings to strengthen semantic correspondence, and model interactions between target objects and their surrounding context. By jointly reasoning over object-centric and background cues, the framework disentangles informative visual evidence from spurious correlations and enables a consistent OOD scoring mechanism across different visual granularities. Extensive experiments on standard object- and image-level benchmarks demonstrate that UNI-OOD achieves substantial and consistent improvements over previous approaches, establishing new SOTA performance in both object-level and image-level OOD detection. Beyond empirical gains, this study provides the first holistic formulation of OOD detection that bridges the gap between object- and image-level detection within a single unified vision–language paradigm, establishing a general foundation for open-world applications.