Poster Sun, Jun 7, 2026 • 2:30 PM – 4:30 PM PDT ExHall A 221

KαLOS finds Consensus: A Meta-Algorithm for Evaluating Inter-Annotator Agreement in Complex Vision Tasks

David Tschirschwitz ⋅ Volker Rodehorst

Project Page

Abstract

Progress in computer vision relies on the interplay of data, algorithms, and computation. For foundational tasks such as object detection, supervised learning with human-annotated data remains the state-of-the-art approach. However, this "gold-standard" data is notoriously error-prone, which is a fundamental bottleneck that hinders both model training and evaluation. As a result, benchmarking improvements have become negligible or non-existent in the last year. This issue does not stem from algorithms or computation, but from problem specifications and the dataset creation process. This ultimately leads to ill-defined tasks with noisy labels. Although statistical methods for Inter-Annotator Agreement (IAA) exist, they are often applied inconsistently and lack standardization, which makes dataset quality comparisons unreliable.We propose a unified meta-algorithm for dataset quality evaluation called K$\alpha$LOS (Krippendorff's $\alpha$ Localization Object Sensing) that serves as a tool for dataset creation and final assessment. Our framework conceptually incorporates existing methods and extends upon them. This provides a broader scope, as our method applies to any combined localization and classification task. It provides greater analytical depth than competing methods, enabling downstream tasks such as evaluating intra-annotator-consistency, rater vitality, and localization sensitivity. Crucially, it is modular, flexible, and extensible, allowing components to be interchanged for specific use-cases and enabling comparability across datasets and tasks.Validating such a metric is challenging, as no "real" ground truth exists. Typically, what we evaluate is considered the ground truth and starting point in the modeling process. Prior validation often relies on heuristics or machine-generated labels that fail to capture the complexity of real annotation noise. Therefore, we introduce an experimental validation approach using an empirical noise generator from real, multi-annotated datasets, which also scrutinizes heuristic assumptions about the noise distribution.