Hier-COS: Making Deep Features Hierarchy-aware via Composition of Orthogonal Subspaces
Abstract
Traditional classifiers treat all class labels as mutually independent, thereby considering all negative classes to be equally incorrect. This approach fails severely in many real-world scenarios, where a known semantic hierarchy defines a partial order of preferences over negative classes. While hierarchy-aware feature representations have shown promise in mitigating this problem, their performance is typically assessed using metrics like Mistake Severity (MS) and Average Hierarchical Distance (AHD). In this paper, we highlight important shortcomings in existing hierarchical evaluation metrics, demonstrating that they are often incapable of measuring true hierarchical performance. Our analysis reveals that existing methods learn sub-optimal hierarchical representations, despite competitive MS and AHD scores. To counter these issues, we introduce Hierarchical Composition of Orthogonal Subspaces (Hier-COS), a novel framework for unified 'hierarchy-aware fine-grained' and 'hierarchical multi-label' classification. We show that Hier-COS is theoretically guaranteed to be consistent with the given hierarchy tree. Furthermore, our framework implicitly adapts the learning capacity for different classes based on their position within the hierarchy tree — a vital property absent in existing methods. Finally, to address the limitations of evaluation metrics, we propose Hierarchically Ordered Preference Score (HOPS), a ranking-based metric that demonstrably overcomes the deficiencies of current evaluation standards. We benchmark Hier-COS on four challenging datasets, including the deep and imbalanced tieredImageNet-H (12-level) and iNaturalist-19 (7-level). Through extensive experiments, we demonstrate that Hier-COS achieves state-of-the-art performance across all hierarchical metrics for every dataset, while simultaneously beating the top-1 accuracy in all but one case. Lastly, we show that Hier-COS can effectively learn to transform the frozen features extracted from a pretrained backbone (ViT) to be hierarchy-aware, yielding substantial benefits for hierarchical classification performance.