Free-Grained Hierarchical Visual Recognition
Abstract
Hierarchical image recognition predicts labels across a semantic taxonomy, but existing methods typically assume complete, fine-grained labels, an assumption rarely met in practice. Real-world annotations vary in granularity due to image quality, annotator expertise, and task goals; a distant bird may be labeled "Bird'', while a close-up reveals "Bank Swallow''. We formalize this realistic setting as free-grain learning, where each image may be labeled at any taxonomy level, while the model must still learn the full hierarchical path. To study this problem, we build diverse benchmarks that provide labels at varying semantic granularity, including a new three-level ImageNet-F and mixed-granularity variants of datasets. We further develop strong baselines that improve learning under mixed supervision through (1) semantic guidance from vision–language models and (2) visual guidance via semi-supervised learning. Together, our benchmarks and methods advance hierarchical recognition under real-world constraints.