Poster
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
Nina Shvetsova · Arsha Nagrani · Bernt Schiele · Hilde Kuehne · Christian Rupprecht
We propose a new "Unbiased through Textual Description (UTD)" video benchmark based on unbiased subsets of existing video classification and retrieval datasets to enable a more robust assessment of video understanding capabilities.Namely, we tackle the problem that current video benchmarks may suffer from different representation biases, e.g., object bias or single-frame bias, where mere recognition of objects or utilization of only a single frame is sufficient for correct prediction. We leverage VLMs and LLMs to analyze and debias video benchmarks from such representation biases. Specifically, we generate frame-wise textual descriptions of videos, filter them for specific information (e.g. only objects) and leverage them to examine representation biases across three dimensions: 1) concept bias — determining if a specific concept (e.g., objects) alone suffice for prediction; 2) temporal bias — assessing if temporal information contributes to prediction; and 3) common sense vs. dataset bias —evaluating whether zero-shot reasoning or dataset correlations contribute to prediction. Since our new toolkit allows us to analyze representation biases at scale without additional human annotation, we conduct a systematic and comprehensive analysis of representation biases in 12 popular video classification and retrieval datasets and create new object-debiased test splits for these datasets. Moreover, we benchmark 33 state-of-the-art video models on original and debiased splits and analyze biases in the models. To facilitate the future development of more robust video understanding benchmarks and models, we release: "UTD-descriptions", a dataset with our rich structured descriptions for each dataset, and "UTD-splits", a dataset of object debiased test splits.
Live content is unavailable. Log in and register to view live content