VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
Tanush Yadav ⋅ Reza Salehi ⋅ Jae Sung Park ⋅ Vivek Ramanujan ⋅ Hannaneh Hajishirzi ⋅ Yejin Choi ⋅ Ali Farhadi ⋅ Rohun Tripathi ⋅ Ranjay Krishna
Abstract
Videos capture a rich array of subtleties in actions. While large video language models have advanced in understanding long videos, their ability to discern nuanced motions in domain-specific, fine-grained actions remains unclear. Current benchmarks evaluate for fine-grained actions in a domain agnostic manner, making to hard to evaluate models on this task. To address this gap, we introduce \dataset, a comprehensive benchmark aimed at evaluating the domain-specific, fine-grained action understanding of video models.This benchmark covers $1,087$ distinct actions spanning $38$ domains, from bouldering to suturing.Our evaluations demonstrate that current video models encounter significant difficulties in recognizing these actions in a zero-shot scenario. We then examine how to improve model performance on this task. To this end, we collect a training dataset of 160K clips of fine-grained, domain-specific actions. Post-training a 4B model on this data, we surpass all Gemini models and GPT-4o on our benchmark. Next, we evaluate few-shot evaluation and demonstrate that even the best-performing model, GPT-5, struggles in a few-shot evaluation setting. When given three in-context examples, the gap between model and human performance widens, with human accuracy improving by 13% while models only improve by 3%. This suggests that video language models are currently not effective few-shot learners--unlike their text-only counterparts and further gains may be elicited from improving these models' few-short learning capabilities.
Successful Page Load